STT WebSocket

Real-time Speech-to-Text transcription via WebSocket streaming with support for 39 languages (including code-switched Indic+English) and telephony integration. Powered by 60db STT v01 (a non-hallucinating, multi-backend speech recognition stack).

Endpoint

Authentication

Query parameter authentication: Examples:

ws://api.60db.ai/ws/stt?apiKey=sk_live_your_api_key
ws://api.60db.ai/ws/stt?token=eyJ...&workspace_id=24

The WebSocket connection checks workspace wallet balance before starting a session. If the workspace has insufficient credits, the connection is closed with a 1008 status code and an INSUFFICIENT_CREDITS error.

Connection Details

Property	Value
Protocol	WebSocket (RFC 6455)
Frame types	Binary (telephony) or Text/JSON (browser)
Ping/keepalive	Server sends WebSocket pings every 30s (timeout 10s)

The same port accepts plain HTTP GET requests and responds 200 ok — safe for load-balancer health checks.

Session Lifecycle

Client                                    Server
  │                                         │
  │──── TCP/TLS connect ───────────────────►│
  │◄─── {"type":"connecting", ...} ─────────│  authenticating...
  │◄─── {"type":"connected", server_info} ──│  proxy wired to upstream
  │◄─── {"type":"connection_established"} ──│  ready — safe to send `start`
  │                                         │
  │──── {"type":"start", ...} ─────────────►│
  │◄─── {"type":"session_started", ...} ────│  safe to send audio
  │                                         │
  │──── audio frames / JSON audio ─────────►│
  │◄─── {"type":"speech_started"} ──────────│  VAD detected voice
  │◄─── {"type":"transcription", ...} ──────│  is_final=true, speech_final=false (first emit, context only)
  │◄─── {"type":"transcription", ...} ──────│  is_final=true, speech_final=true  (canonical answer)
  │                                         │
  │──── {"type":"stop"} ────────────────────►│
  │◄─── {"type":"session_stopped"} ─────────│

Two-phase finals (context-gated LLM refinement). When you supply a context object on start, every utterance produces two transcription events sharing a sentence_id:

First emit — is_final: true, speech_final: false — fast dict-corrected text. Use for low-latency UI paint and barge-in.
Canonical — is_final: true, speech_final: true — definitive LLM-refined answer. Always arrives.

When context is omitted, every utterance produces a single transcription event with is_final: true, speech_final: true (no first emit). Simple consumers can gate exclusively on speech_final: true regardless of whether refinement is on.

Do not send the start message until connection_established is received. The backend proxy attaches its client-message listener only after authenticating and opening the upstream connection. Messages sent earlier will be silently dropped. Likewise, do not send audio messages until session_started is received — the upstream returns unknown message type: audio if it arrives before the session is ready.

Client → Server Messages

`start` — Begin session

Sent once after connection is established. Must be sent before any audio.

{
  "type": "start",
  "languages": ["en", "hi"],
  "context": {
    "general": [
      { "key": "domain",  "value": "Healthcare" },
      { "key": "doctor",  "value": "Dr. Martha Smith" }
    ],
    "text":  "Routine diabetes follow-up consultation.",
    "terms": ["Celebrex", "Zyrtec", "Metformin", "HbA1c"]
  },
  "config": {
    "encoding": "linear",
    "sample_rate": 48000,
    "utterance_end_ms": 500,
    "continuous_mode": true,
    "interim_results_frequency": 300,
    "audio_enhancement": "adaptive",
    "diarize": false
  }
}

Parameters:

Never send languages: "auto" or languages: ["auto"]. The auto-detect entry in GET /stt/languages is a convenience for the REST /stt form-upload flow only. On WebSocket, the server only accepts real ISO codes and uses null as the auto-detect signal. Sending "auto" returns language 'auto' is not in the v1 supported. The 60db WebSocket proxy (/ws/stt) strips the string "auto" from incoming start and config messages as a safety net, but your client should send null directly.

`audio` — JSON audio chunk (browser mode)

{
  "type": "audio",
  "audio": "<base64-encoded Int16 PCM or μ-law bytes>",
  "encoding": "linear",
  "sample_rate": 48000,
  "timestamp": 1700000000000
}

Fields:

Binary frame — raw μ-law audio (telephony mode)

Send a raw WebSocket binary frame with μ-law bytes, no JSON wrapper. The server auto-detects this as telephony mode on the first binary frame.

Recommended chunk size: 480 bytes = 60ms at 8kHz
Twilio default: 160 bytes = 20ms — batch 3 chunks into 60ms before sending

`config` — Change language mid-session

{
  "type": "config",
  "languages": ["hi"],
  "continuous_mode": true
}

Both languages and continuous_mode are optional; include only fields you want to change. Send "languages": null to revert to auto-detect.

`stop` — End session

{
  "type": "stop"
}

Server processes any remaining audio buffer, sends session_stopped, then closes.

`test` — Ping / latency check

{
  "type": "test",
  "message": "ping",
  "timestamp": 1700000000000
}

Server echoes test_response with the same timestamp for round-trip measurement.

Server → Client Messages

`connecting` — Authentication in progress

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

`connection_established` — Authentication successful

{
  "connection_established": {
    "service": "stt",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}

Fields:

service

string

Service name: "stt"

user_id

integer

Your user ID

credit_balance

number

Available credits

workspace

string

Workspace name

`connected` — Proxy wired to upstream STT server

Sent by the backend proxy after it has opened the upstream connection and attached its client-message listener. After this point, client messages are no longer dropped.

{
  "type": "connected",
  "server_info": {
    "server_type": "60db STT",
    "ready": true,
    "total_languages": 40,
    "features": {
      "vad_segmentation": true,
      "multi_language_per_session": true,
      "max_languages_per_session": 5,
      "telephony_mulaw": true,
      "browser_pcm": true,
      "continuous_mode": true,
      "interim_results": true,
      "speaker_diarization": true,
      "code_switching_indic_english": true,
      "non_hallucinating": true
    },
    "timing": {
      "min_utterance_end_ms": 300,
      "default_utterance_end_ms": 500,
      "max_utterance_seconds": 30
    }
  }
}

`session_started` — `start` message processed, audio is now accepted

Sent by the upstream STT server after a start message is received and validated. It is safe to begin sending audio frames immediately after this event.

{
  "type": "session_started",
  "session_id": "sess_8c3d1a9f4b7e2c51",
  "language": "Multi-language: EN, HI",
  "languages": ["en", "hi"],
  "device": "cuda:0",
  "model": "60db-stt-v01",
  "processing_mode": "sentence_based_continuous",
  "continuous_mode": true,
  "interim_frequency": 300,
  "diarize": false
}

`speech_started` — VAD detected voice activity

{
  "type": "speech_started",
  "timestamp": 1700000000.123
}

Use this for barge-in: interrupt TTS playback when this arrives. Fired after 2 consecutive VAD-positive chunks (~64ms of confirmed speech).

`transcription` — Transcription result

All results (interim and final) share the same transcription type — differentiate with flags. Final result (is_final=true, speech_final=true):

{
  "type": "transcription",
  "text": "Hello, how are you?",
  "confidence": 0.87,
  "language": "en",
  "language_name": "EN",
  "is_final": true,
  "speech_final": true,
  "is_partial": false,
  "sentence_id": 3,
  "duration": 1.82,
  "latency": 0.43,
  "timestamp": 1700000000.456,
  "words": [
    { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 },
    { "word": "how",   "start": 0.35, "end": 0.52, "confidence": 0.92 }
  ],
  "speakers": [
    { "speaker": "SPEAKER_00", "start": 0.0, "end": 1.82 }
  ]
}

Empty speech_final signal (text="", is_final=true, speech_final=true): Sent when audio was detected but transcription was rejected (silence, hallucination, low confidence, wrong language). Client should reset its state on this message and not treat it as an error.

{
  "type": "transcription",
  "text": "",
  "confidence": 0.0,
  "is_final": true,
  "speech_final": true,
  "processing_mode": "speech_end_no_result",
  "timestamp": 1700000000.789
}

Interim result (is_final=false, speech_final=false) — only sent when interim_results_frequency is set:

{
  "type": "transcription",
  "text": "Hello how",
  "confidence": 0.72,
  "language": "en",
  "is_final": false,
  "speech_final": false,
  "is_partial": true
}

Use interims only for barge-in word-count checks. Never send interim text to the LLM — a final with is_final=true, speech_final=true will follow. Response Fields:

text

string

Transcribed text. Empty string = speech-end-no-result signal.

confidence

number

0.0–1.0. Telephony typically 0.35–0.75; browser 0.55–0.95.

language

string

Detected language code e.g. "en".

language_name

string

Uppercase language code (e.g. "EN") in the WS shape. Note: REST /stt returns the full English name ("English") here — WS preserves the legacy uppercase-code shape for client compatibility.

is_final

boolean

true = end of speech reached. May still be followed by a canonical upgrade if LLM refinement is active.

speech_final

boolean

true = canonical answer, will not be revised. When LLM refinement is on, one is_final: true, speech_final: false event is followed by one is_final: true, speech_final: true. When refinement is off, every final is speech_final: true. See Canonical-answer semantics.

is_partial

boolean

true for interim results only.

sentence_id

integer

Monotonically increasing counter per session.

duration

number

Duration (seconds) of the audio segment transcribed.

latency

number

Seconds from processing start to result ready (excludes queue time).

words

array

Word-level timestamps [{word, start, end, confidence, boosted?, original?}]. Note: the field name is confidence, not probability (60db STT convention — different from legacy Whisper docs). Present on finals; empty on interims.When the keyword/context-terms boost replaced a word, the entry includes boosted: true and original (the pre-boost word):

{ "word": "Acme", "start": 1.5, "end": 2.0, "confidence": 0.85,
  "boosted": true, "original": "akmie" }

The segment-level text is already rebuilt from boosted words upstream — no client-side stitching required. Recommended UI: subtle underline on boosted: true words, with original shown on hover.

speakers

array

List of [{speaker, start, end}] diarization turns when config.diarize=true. Omitted or null otherwise. Raw speaker IDs look like SPEAKER_00, SPEAKER_01; clients typically re-label these as “Speaker 1”, “Speaker 2” in order of first appearance.

processing_mode

string

Marker for utterances that should be skipped by the consumer (no useful text, never billed). Omitted on ordinary finals.

Value	Meaning
`speech_end_no_result`	Speech ended but the recognizer produced no text.
`speech_end_too_short`	Utterance below the minimum duration to recognize.
`hallucination_rejected`	Word-rate guard rejected output as likely hallucination.
`low_snr_dropped`	Audio dropped before LID/ASR; SNR below floor. No credits charged. Treat as “skip this utterance — no useful text”.

snr_db

number

Audio signal-to-noise ratio (dB) for this utterance, when measured. Optional. Surface as a “good / fair / poor” badge: >= 15 good, 0–15 fair, < 0 poor.

Canonical-answer semantics: `speech_final`

is_final and speech_final are NOT identical when LLM refinement is active — they split into two distinct meanings:

`is_final`	`speech_final`	Meaning
`false`	`false`	Interim partial — text may still change as more audio arrives.
`true`	`false`	End-of-speech reached, dict-corrected text. The LLM is still processing. A follow-up event with `speech_final: true` will arrive shortly with the canonical text. Only emitted when refinement is active for this utterance.
`true`	`true`	Canonical answer. Definitive, will not be revised. LLM-refined text when context was supplied, otherwise the original ASR text.

The same sentence_id is echoed across both phases so clients can reconcile. Canonical event example (after LLM refinement):

{
  "type": "transcription",
  "sentence_id": 3,
  "text": "डॉक्टर साहब, मेरा sugar level बहुत high है। Metformin की dose बढ़ाओ।",
  "confidence": 0.87,
  "language": "hi",
  "language_name": "HI",
  "is_final": true,
  "speech_final": true,
  "is_partial": false,
  "duration": 1.82,
  "words": [
    { "word": "डॉक्टर", "start": 0.0, "end": 0.32, "confidence": 0.94 }
  ],
  "speakers": null,
  "timestamp": 1700000000.789
}

Guarantees:

Exactly one canonical event per utterance. When refinement is on, you get two transcription events per utterance (first emit + canonical). When refinement is off, you get one (speech_final: true). Never zero, never three.
Same sentence_id across both phases. Reconcile on that key.
The canonical always arrives. Consumers waiting on speech_final: true never hang.
sentence_id ordering is preserved per session, but canonicals are NOT guaranteed to arrive in sentence_id order when LLM is on — two utterances finalizing close in time may complete refinement out of order. Key on sentence_id, not arrival order.
words[] corresponds to the original ASR output on both phases — the LLM does not realign tokens. Use words[] for word-level timing, text for display.

Recommended client patterns: Simplest — don’t care about the first-emit optimization:

function onMessage(msg) {
  if (msg.type === 'transcription' && msg.speech_final) {
    // Canonical — render and forget
    render(msg);
  }
  // Ignore is_final && !speech_final  (intermediate, will be replaced)
  // Ignore is_partial                 (interim, handle separately if needed)
}

With fast first-emit (UX-aware):

const livePainted = new Map();   // sentence_id → line slot

function onMessage(msg) {
  if (msg.type !== 'transcription') return;
  const sid = msg.sentence_id;

  if (msg.is_partial) { renderPartial(msg); return; }

  let entry = livePainted.get(sid);
  if (!entry) {
    entry = createLine();
    livePainted.set(sid, entry);
  }
  entry.text    = msg.text;
  entry.pending = msg.is_final && !msg.speech_final;   // dim while LLM runs
  render(entry);

  if (msg.speech_final) {
    finalize(entry);
    livePainted.delete(sid);
  }
}

For voicebot NLU routing: feed the first-emit text (speech_final: false) to NLU immediately for fast intent dispatch — don’t wait for canonical. If your NLU benefits from proper-noun accuracy (name-spelling slots, drug-name lookup), run a second-pass call on the canonical (speech_final: true) text and reconcile on sentence_id.

Legacy refined event. Earlier builds emitted a separate refined event ~400 ms after the final instead of a second transcription. The 60db /ws/stt proxy transparently handles both shapes — if you’re still seeing refined events in the wire trace, upstream workers haven’t been restarted onto the two-phase build yet. New client code should target the two-phase flow only; refined is accepted but deprecated.

`language_changed` — After `config` message changes language

{
  "type": "language_changed",
  "language": "Multi-language: HI",
  "language_code": ["hi"]
}

`mode_changed` — After `config` message changes `continuous_mode`

{
  "type": "mode_changed",
  "continuous_mode": true,
  "mode_name": "continuous",
  "silence_threshold": 0.5
}

`session_stopped` — After `stop` is processed

{
  "type": "session_stopped",
  "billing_summary": {
    "total_duration_seconds": 12.40,
    "total_cost": 0.000620,
    "characters_transcribed": 188,
    "client_estimated_seconds": 12.42
  }
}

billing_summary fields:

Field	Type	Notes
`total_duration_seconds`	number	The audio actually billed — sum of canonical finals that passed the billable predicate. Sessions that previously double-billed when the client sent `stop` are now billed correctly; expect ~50% lower numbers for affected sessions vs historical values.
`total_cost`	number	USD cost for the session.
`characters_transcribed`	number	Total characters across canonical transcriptions.
`client_estimated_seconds`	number	Diagnostic only — rough estimate of audio the client sent. Useful for debugging duration drift; never display as a billed number.

`error` — Processing error

{
  "type": "error",
  "error": "Audio processing error: ...",
  "timestamp": 1700000000.0
}

Concurrency-limit error frame

When a user has reached their per-user STT session cap (counted across REST + WS combined), the server sends an error frame and closes with code 1008:

{
  "type": "error",
  "error": "Too many concurrent STT sessions for this user",
  "error_code": "STT_CONCURRENCY_LIMIT",
  "details": { "limit": 8 }
}

Existing sessions are unaffected; the cap releases when an in-flight session ends. Do not auto-reconnect on 1008 — the limit only frees when an in-flight session completes. The frontend should distinguish concurrency-limit closes from auth/network failures so the UI message is correct.

`test_response` — Reply to `test` ping

{
  "type": "test_response",
  "message": "pong - Sentence-based STT ready",
  "timestamp": 1700000000000,
  "processing_mode": "sentence_based"
}

Complete Example

Audio Requirements

Property	Telephony (μ-law)	Browser (PCM)
Encoding	`mulaw` (8-bit)	`linear` (16-bit)
Sample Rate	8000 Hz	16000, 24000, 44100, 48000 Hz
Chunk Size	480 bytes (60ms)	960-1920 bytes (60-120ms)
Channels	Mono (1 channel)	Mono (1 channel)

Supported Languages

39 transcription languages total (25 European, 13 Indic with Hinglish code-switching, and Arabic MSA). Fetch the full catalog from GET /stt/languages.

Code	Language	Code	Language
`en`	English	`hi`	Hindi
`es`	Spanish	`bn`	Bengali
`fr`	French	`mr`	Marathi
`de`	German	`pa`	Punjabi
`it`	Italian	`gu`	Gujarati
`pt`	Portuguese	`ta`	Tamil
`nl`	Dutch	`te`	Telugu
`pl`	Polish	`kn`	Kannada
`ru`	Russian	`ml`	Malayalam
`uk`	Ukrainian	`or`	Odia
`cs`	Czech	`as`	Assamese
`sv`	Swedish	`ne`	Nepali
`ar`	Arabic (MSA)	`sa`	Sanskrit

Code-switching (Indic + English): hi+en, bn+en, mr+en, pa+en, gu+en, or+en, as+en, ne+en, te+en, kn+en, ta+en, ml+en — collapses to the fast path when both languages share the same pipeline. Not supported (explicit rejection): ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he. These return an unsupported_language error — there is no silent aliasing. Arabic dialect tags (ar-eg, ar-lv, ar-gu, ar-ma) return dialect_not_supported — pass ar for best-effort MSA transcription of dialectal audio.

Limitations and Best Practices

Handshake ordering (most common mistake) Do not send start in ws.onopen. Wait for the proxy’s connection_established message first — it marks the point at which the proxy has attached its client-message listener. Likewise, wait for session_started before sending audio frames, otherwise the upstream server returns unknown message type: audio. utterance_end_ms clamp Values below 300 ms are silently clamped to 300 ms. Sub-300 ms would fragment utterances and cause the non-hallucinating backends to drop short segments. Language count Max 5 languages per session. Cross-backend multi-language (e.g. ["en","ar"]) runs per-utterance LID which adds ~20–50 ms per utterance. Same-backend multi-language (e.g. ["en","hi"]) collapses to a single backend’s fast path with zero LID overhead — use this whenever possible. Telephony confidence 8 kHz μ-law → 16 kHz resampling reduces backend confidence by ~0.10–0.15 compared to native wideband input. Use a client-side threshold of 0.35 for telephony vs 0.55 for browser. Buffer limits

Pre-speech ring buffer: 1.0 s (captures first word before VAD fires)
Minimum coalesce before VAD: 160 ms (μ-law 1280 bytes, linear sample_rate × 2 × 160 / 1000 bytes)
Maximum utterance duration: 30 s (anything longer force-finalizes)

VAD

Speech start: Silero probability > STT_VAD_THRESHOLD (default 0.5, server-configurable)
Silence → utterance end: utterance_end_ms of consecutive sub-threshold audio
No separate continuation threshold

Non-hallucinating architecture The 60db STT backend uses non-hallucinating architectures that do not output no_speech_prob — the CTC/RNN-T topology emits blank tokens on non-speech. Hallucination guard is a word-rate sanity check (> 5 words/second → rejected). Diarization diarize=true requires HF_TOKEN on the server plus gated model approval for pyannote/speaker-diarization-3.1. Without both, the request silently falls back to a deterministic mock. Check session_started.diarize to confirm the request was accepted.

Pricing

Rate: $0.00000833 per second
Minimum: $0.01 per session
Billing: Per second of audio processed

Error Codes

Close code	`error_code` (in preceding error frame)	Description
1008	`UNAUTHENTICATED`	Authentication failed
1008	`INSUFFICIENT_CREDITS`	Workspace wallet has no funds
1008	`STT_CONCURRENCY_LIMIT`	Per-user concurrency cap reached. `details.limit` carries the active cap. Do not auto-reconnect; cap releases when an in-flight session ends.
1011	—	Internal server error
1006	—	Connection lost

Testing

# Install wscat
npm install -g wscat

# Test STT
wscat -c "ws://api.60db.ai/ws/stt?apiKey=sk_live_your_key"

Then send:

{"type":"start","languages":["en"],"config":{"encoding":"mulaw","sample_rate":8000,"continuous_mode":true}}

TTS WebSocket - Text-to-Speech endpoint
WebSocket API Reference - Complete documentation
Voices API - Get available voices

​STT WebSocket

​Endpoint

​Authentication

​Connection Details

​Session Lifecycle

​Client → Server Messages

​start — Begin session

​audio — JSON audio chunk (browser mode)

​Binary frame — raw μ-law audio (telephony mode)

​config — Change language mid-session

​stop — End session

​test — Ping / latency check

​Server → Client Messages

​connecting — Authentication in progress

​connection_established — Authentication successful

​connected — Proxy wired to upstream STT server

​session_started — start message processed, audio is now accepted

​speech_started — VAD detected voice activity

​transcription — Transcription result

​Canonical-answer semantics: speech_final

​language_changed — After config message changes language

​mode_changed — After config message changes continuous_mode

​session_stopped — After stop is processed

​error — Processing error

​Concurrency-limit error frame

​test_response — Reply to test ping

​Complete Example

​Audio Requirements

​Supported Languages

​Limitations and Best Practices

​Pricing

​Error Codes

​Testing

​Related

STT WebSocket

Endpoint

Authentication

Connection Details

Session Lifecycle

Client → Server Messages

`start` — Begin session

`audio` — JSON audio chunk (browser mode)

Binary frame — raw μ-law audio (telephony mode)

`config` — Change language mid-session

`stop` — End session

`test` — Ping / latency check

Server → Client Messages

`connecting` — Authentication in progress

`connection_established` — Authentication successful

`connected` — Proxy wired to upstream STT server

`session_started` — `start` message processed, audio is now accepted

`speech_started` — VAD detected voice activity

`transcription` — Transcription result

Canonical-answer semantics: `speech_final`

`language_changed` — After `config` message changes language

`mode_changed` — After `config` message changes `continuous_mode`

`session_stopped` — After `stop` is processed

`error` — Processing error

Concurrency-limit error frame

`test_response` — Reply to `test` ping

Complete Example

Audio Requirements

Supported Languages

Limitations and Best Practices

Pricing

Error Codes

Testing

Related