Skip to main content

STT WebSocket API

Real-time Speech-to-Text transcription via WebSocket streaming with support for 39 languages (including code-switched Indic+English) and telephony integration. Powered by 60db STT v01 (a non-hallucinating, multi-backend speech recognition stack).

🚀 Quick Start (Copy & Paste)

const WebSocket = require('ws');

// 1. Your API key
const API_KEY = 'sk_live_your_api_key';

// 2. Connect
const ws = new WebSocket(`wss://api.60db.ai/ws/stt?apiKey=${API_KEY}`);

// 3. Handle messages
ws.on('message', (data) => {
  const msg = JSON.parse(data);

  // Authenticated? Start session!
  if (msg.connection_established) {
    console.log('✅ Authenticated');
    ws.send(JSON.stringify({
      type: 'start',
      languages: ['en'],
      config: { encoding: 'mulaw', sample_rate: 8000, continuous_mode: true }
    }));
  }

  // Session ready? Send audio!
  if (msg.type === 'connected') {
    console.log('✅ Ready! Send audio now');

    // Send dummy audio (480 bytes every 60ms)
    let count = 0;
    const interval = setInterval(() => {
      ws.send(Buffer.alloc(480, 0xff));
      if (++count >= 83) {  // 5 seconds
        clearInterval(interval);
        ws.send(JSON.stringify({ type: 'stop' }));
      }
    }, 60);
  }

  // Got text!
  if (msg.type === 'transcription' && msg.is_final) {
    console.log('📝', msg.text);
  }

  // Done!
  if (msg.type === 'session_stopped') {
    console.log('✅ Complete! Cost:', msg.billing_summary.total_cost);
    ws.close();
  }
});
That’s it! You’ll see:
  • ✅ Authenticated
  • ✅ Ready! Send audio now
  • 📝 Hello world (transcribed text)
  • ✅ Complete! Cost: $0.000043

📖 How It Works (5 Simple Steps)

  1. Connect with your API key
  2. Send { type: "start", ... } to begin session
  3. Stream audio data (binary chunks)
  4. Receive text transcriptions in real-time
  5. Stop with { type: "stop" } when done

Endpoint

Authentication

Query parameter authentication: Example:
ws://api.60db.ai/ws/stt?apiKey=sk_live_your_api_key

Connection Details

PropertyValue
ProtocolWebSocket (RFC 6455)
Frame typesBinary (telephony) or Text/JSON (browser)
Ping/keepaliveServer sends WebSocket pings every 30s (timeout 10s)

Session Lifecycle

Client                          Server
  │                               │
  │──── TCP/TLS connect ─────────►│
  │◄─── {"connecting": true} ─────│  authenticating...
  │◄─── connection_established ──│  authenticated
  │                               │
  │──── {"type":"start", ...} ───►│
  │◄─── {"type":"connected"} ─────│  session ready
  │                               │
  │──── audio frames / messages ─►│
  │◄─── {"type":"speech_started"} │  VAD detected voice
  │◄─── {"type":"transcription"}  │  is_final=true, speech_final=false (first emit, context only)
  │◄─── {"type":"transcription"}  │  is_final=true, speech_final=true  (canonical answer)
  │                               │
  │──── {"type":"stop"} ─────────►│
  │◄─── {"type":"session_stopped"}│
Two-phase finals (context-gated LLM refinement). When you supply a context object on start, every utterance produces two transcription events sharing a sentence_id:
  1. First emitis_final: true, speech_final: false — fast dict-corrected text. Use for low-latency UI paint and barge-in.
  2. Canonicalis_final: true, speech_final: true — definitive LLM-refined answer. Always arrives.
When context is omitted, every utterance produces a single transcription event with is_final: true, speech_final: true (no first emit). Simple consumers can gate exclusively on speech_final: true and ignore the rest — that gives them exactly one canonical event per utterance regardless of whether refinement is on.

Client → Server Messages

start — Begin session

Sent once after connection is established. Must be sent before any audio.
{
  "type": "start",
  "languages": ["en", "hi"],
  "context": {
    "general": [
      { "key": "domain",  "value": "Healthcare" },
      { "key": "doctor",  "value": "Dr. Martha Smith" }
    ],
    "text":  "Routine diabetes follow-up consultation.",
    "terms": ["Celebrex", "Zyrtec", "Metformin", "HbA1c"]
  },
  "config": {
    "encoding": "mulaw",
    "sample_rate": 8000,
    "utterance_end_ms": 500,
    "continuous_mode": true,
    "interim_results_frequency": 300,
    "diarize": false
  }
}
Parameters:

audio — JSON audio chunk (browser mode)

{
  "type": "audio",
  "audio": "<base64-encoded Int16 PCM or μ-law bytes>",
  "encoding": "linear",
  "sample_rate": 48000,
  "timestamp": 1700000000000
}
Fields:

Binary frame — raw μ-law audio (telephony mode)

Send a raw WebSocket binary frame with μ-law bytes, no JSON wrapper. The server auto-detects this as telephony mode on the first binary frame.
Recommended chunk size: 480 bytes = 60ms at 8kHz
Twilio default: 160 bytes = 20ms — batch 3 chunks into 60ms before sending

config — Change language mid-session

{
  "type": "config",
  "languages": ["hi"],
  "continuous_mode": true
}
Both languages and continuous_mode are optional; include only fields you want to change. Send "languages": null to revert to auto-detect.

stop — End session

{
  "type": "stop"
}
Server processes any remaining audio buffer, sends session_stopped, then closes.

test — Ping / latency check

{
  "type": "test",
  "message": "ping",
  "timestamp": 1700000000000
}
Server echoes test_response with the same timestamp for round-trip measurement.

Server → Client Messages

connecting — Authentication in progress

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

connection_established — Authentication successful

{
  "connection_established": {
    "service": "stt",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}
Fields:
service
string
Service name: "stt"
user_id
integer
Your user ID
credit_balance
number
Available credits
workspace
string
Workspace name

connected — After start message is processed

{
  "type": "connected",
  "server_info": {
    "server_type": "60db STT",
    "device": "cuda",
    "model": "60db-stt-v01",
    "processing_mode": "sentence_based_modular",
    "supported_languages": { "en": "English", "hi": "Hindi" },
    "total_languages": 40,
    "features": {
      "sentence_based_processing": true,
      "real_time_streaming": true,
      "telephony_support": true,
      "unicode_support": true,
      "mixed_language_support": true
    }
  }
}

speech_started — VAD detected voice activity

{
  "type": "speech_started",
  "timestamp": 1700000000.123
}
Use this for barge-in: interrupt TTS playback when this arrives. Fired after 2 consecutive VAD-positive chunks (~64ms of confirmed speech).

transcription — Transcription result

All results (interim and final) share the same transcription type — differentiate with flags. Final result (is_final=true, speech_final=true):
{
  "type": "transcription",
  "text": "Hello, how are you?",
  "confidence": 0.87,
  "language": "en",
  "language_name": "EN",
  "is_final": true,
  "speech_final": true,
  "is_partial": false,
  "sentence_id": 3,
  "processing_mode": "sentence_complete",
  "duration": 1.82,
  "latency": 0.43,
  "timestamp": 1700000000.456,
  "words": [
    { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 },
    { "word": "how",   "start": 0.35, "end": 0.52, "confidence": 0.92 }
  ],
  "utterance_end_ms": 1820
}
Empty speech_final signal (text="", is_final=true, speech_final=true): Sent when audio was detected but transcription was rejected (silence, hallucination, low confidence, wrong language). Client should reset its state on this message and not treat it as an error.
{
  "type": "transcription",
  "text": "",
  "confidence": 0.0,
  "is_final": true,
  "speech_final": true,
  "processing_mode": "speech_end_no_result",
  "timestamp": 1700000000.789
}
Interim result (is_final=false, speech_final=false) — only sent when interim_results_frequency is set:
{
  "type": "transcription",
  "text": "Hello how",
  "confidence": 0.72,
  "language": "en",
  "is_final": false,
  "speech_final": false,
  "is_partial": true
}
Use interims only for barge-in word-count checks. Never send interim text to the LLM — a final with is_final=true, speech_final=true will follow. Response Fields:
text
string
Transcribed text. Empty string = speech-end-no-result signal.
confidence
number
0.0–1.0. Telephony typically 0.35–0.75; browser 0.55–0.95.
language
string
Detected language code e.g. "en".
language_name
string
Uppercase language code e.g. "EN".
is_final
boolean
true = end of speech reached. May still be followed by a canonical upgrade if LLM refinement is active.
speech_final
boolean
true = canonical answer, will not be revised. When LLM refinement is on, one is_final: true, speech_final: false event is followed by one is_final: true, speech_final: true. When refinement is off, every final is speech_final: true. See Canonical-answer semantics.
is_partial
boolean
true for interim results only.
sentence_id
integer
Monotonically increasing counter per session.
duration
number
Duration (seconds) of the audio segment transcribed.
latency
number
Seconds from processing start to result ready (excludes queue time).
words
array
Word-level timestamps [{word, start, end, confidence}]. Note: the field is confidence, not probability. Present on finals; empty on interims.
utterance_end_ms
integer
Timestamp (ms) of last word in the utterance.
processing_mode
string
Internal mode string — useful for debugging.

Canonical-answer semantics: speech_final

is_final and speech_final are NOT identical when LLM refinement is active — they split into two distinct meanings:
is_finalspeech_finalMeaning
falsefalseInterim partial — text may still change as more audio arrives.
truefalseEnd-of-speech reached, dict-corrected text. The LLM is still processing. A follow-up event with speech_final: true will arrive shortly with the canonical text. Only emitted when refinement is active for this utterance.
truetrueCanonical answer. Definitive, will not be revised. LLM-refined text when context was supplied, otherwise the original ASR text.
The same sentence_id is echoed across both phases so clients can reconcile. Canonical event example (after LLM refinement):
{
  "type": "transcription",
  "sentence_id": 3,
  "text": "डॉक्टर साहब, मेरा sugar level बहुत high है। Metformin की dose बढ़ाओ।",
  "confidence": 0.87,
  "language": "hi",
  "language_name": "HI",
  "is_final": true,
  "speech_final": true,
  "is_partial": false,
  "duration": 1.82,
  "words": [
    { "word": "डॉक्टर", "start": 0.0, "end": 0.32, "confidence": 0.94 }
  ],
  "speakers": null,
  "timestamp": 1700000000.789
}
Guarantees:
  • Exactly one canonical event per utterance. When refinement is on, you get two transcription events per utterance (first emit + canonical). When refinement is off, you get one (speech_final: true). Never zero, never three.
  • Same sentence_id across both phases. Reconcile on that key.
  • The canonical always arrives. Consumers waiting on speech_final: true never hang.
  • sentence_id ordering is preserved per session, but canonicals are NOT guaranteed to arrive in sentence_id order when LLM is on — two utterances finalizing close in time may complete refinement out of order. Key on sentence_id, not arrival order.
  • words[] corresponds to the original ASR output on both phases — the LLM does not realign tokens. Use words[] for word-level timing, text for display.
Recommended client patterns: Simplest — don’t care about the first-emit optimization:
function onMessage(msg) {
  if (msg.type === 'transcription' && msg.speech_final) {
    // Canonical — render and forget
    render(msg);
  }
  // Ignore is_final && !speech_final  (intermediate, will be replaced)
  // Ignore is_partial                 (interim, handle separately if needed)
}
With fast first-emit (UX-aware):
const livePainted = new Map();   // sentence_id → line slot

function onMessage(msg) {
  if (msg.type !== 'transcription') return;
  const sid = msg.sentence_id;

  if (msg.is_partial) { renderPartial(msg); return; }

  let entry = livePainted.get(sid);
  if (!entry) {
    entry = createLine();
    livePainted.set(sid, entry);
  }
  entry.text    = msg.text;
  entry.pending = msg.is_final && !msg.speech_final;   // dim while LLM runs
  render(entry);

  if (msg.speech_final) {
    finalize(entry);
    livePainted.delete(sid);
  }
}
For voicebot NLU routing: feed the first-emit text (speech_final: false) to NLU immediately for fast intent dispatch — don’t wait for canonical. If your NLU benefits from proper-noun accuracy (name-spelling slots, drug-name lookup), run a second-pass call on the canonical (speech_final: true) text and reconcile on sentence_id.
Legacy refined event. Earlier builds emitted a separate refined event ~400 ms after the final instead of a second transcription. The 60db /ws/stt proxy transparently handles both shapes — if you’re still seeing refined events in the wire trace, upstream workers haven’t been restarted onto the two-phase build yet. New client code should target the two-phase flow only; refined is accepted but deprecated.

language_changed — After config message changes language

{
  "type": "language_changed",
  "language": "Multi-language: HI",
  "language_code": ["hi"]
}

mode_changed — After config message changes continuous_mode

{
  "type": "mode_changed",
  "continuous_mode": true,
  "mode_name": "continuous",
  "silence_threshold": 0.5
}

session_stopped — After stop is processed

{
  "type": "session_stopped",
  "billing_summary": {
    "total_duration_seconds": 5.2,
    "total_cost": 0.000043,
    "characters_transcribed": 42
  }
}

error — Processing error

{
  "type": "error",
  "error": "Audio processing error: ...",
  "timestamp": 1700000000.0
}

test_response — Reply to test ping

{
  "type": "test_response",
  "message": "pong - Sentence-based STT ready",
  "timestamp": 1700000000000,
  "processing_mode": "sentence_based"
}

Complete Example

    Audio Requirements

    PropertyTelephony (μ-law)Browser (PCM)
    Encodingmulaw (8-bit)linear (16-bit)
    Sample Rate8000 Hz16000, 24000, 44100, 48000 Hz
    Chunk Size480 bytes (60ms)960-1920 bytes (60-120ms)
    ChannelsMono (1 channel)Mono (1 channel)

    Supported Languages

    39 languages total, backed by Parakeet-TDT (25 European), Vaani-FastConformer (13 Indic + Hinglish), and FC-Arabic (MSA). Fetch the full catalog from GET /stt/languages. Not supported (explicit rejection, no silent aliasing): ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he. Arabic dialect tags (ar-eg, ar-lv, …) return dialect_not_supported — pass ar for best-effort MSA. Common languages:
    CodeLanguageCodeLanguage
    enEnglishhiHindi
    bnBengaliesSpanish
    frFrenchdeGerman
    guGujaratitaTamil
    teTeluguknKannada
    mlMalayalammrMarathi
    paPunjabiarArabic

    Pricing

    • Rate: $0.00000833 per second
    • Minimum: $0.01 per session
    • Billing: Per second of audio processed