STT WebSocket API
Real-time Speech-to-Text transcription via WebSocket streaming with support for 39 languages (including code-switched Indic+English) and telephony integration. Powered by 60db STT v01 (a non-hallucinating, multi-backend speech recognition stack).🚀 Quick Start (Copy & Paste)
- ✅ Authenticated
- ✅ Ready! Send audio now
- 📝 Hello world (transcribed text)
- ✅ Complete! Cost: $0.000043
📖 How It Works (5 Simple Steps)
- Connect with your API key
- Send
{ type: "start", ... }to begin session - Stream audio data (binary chunks)
- Receive text transcriptions in real-time
- Stop with
{ type: "stop" }when done
Endpoint
Authentication
Query parameter authentication: Example:Connection Details
| Property | Value |
|---|---|
| Protocol | WebSocket (RFC 6455) |
| Frame types | Binary (telephony) or Text/JSON (browser) |
| Ping/keepalive | Server sends WebSocket pings every 30s (timeout 10s) |
Session Lifecycle
context object on start, every utterance produces two transcription events sharing a sentence_id:
- First emit —
is_final: true, speech_final: false— fast dict-corrected text. Use for low-latency UI paint and barge-in. - Canonical —
is_final: true, speech_final: true— definitive LLM-refined answer. Always arrives.
context is omitted, every utterance produces a single transcription event with is_final: true, speech_final: true (no first emit). Simple consumers can gate exclusively on speech_final: true and ignore the rest — that gives them exactly one canonical event per utterance regardless of whether refinement is on.
Client → Server Messages
start — Begin session
Sent once after connection is established. Must be sent before any audio.
audio — JSON audio chunk (browser mode)
Binary frame — raw μ-law audio (telephony mode)
Send a raw WebSocket binary frame with μ-law bytes, no JSON wrapper. The server auto-detects this as telephony mode on the first binary frame.config — Change language mid-session
languages and continuous_mode are optional; include only fields you want to change.
Send "languages": null to revert to auto-detect.
stop — End session
session_stopped, then closes.
test — Ping / latency check
test_response with the same timestamp for round-trip measurement.
Server → Client Messages
connecting — Authentication in progress
connection_established — Authentication successful
Service name:
"stt"Your user ID
Available credits
Workspace name
connected — After start message is processed
speech_started — VAD detected voice activity
transcription — Transcription result
All results (interim and final) share the same transcription type — differentiate with flags.
Final result (is_final=true, speech_final=true):
text="", is_final=true, speech_final=true):
Sent when audio was detected but transcription was rejected (silence, hallucination, low confidence, wrong language).
Client should reset its state on this message and not treat it as an error.
is_final=false, speech_final=false) — only sent when interim_results_frequency is set:
is_final=true, speech_final=true will follow.
Response Fields:
Transcribed text. Empty string = speech-end-no-result signal.
0.0–1.0. Telephony typically 0.35–0.75; browser 0.55–0.95.
Detected language code e.g.
"en".Uppercase language code e.g.
"EN".true = end of speech reached. May still be followed by a canonical upgrade if LLM refinement is active.true = canonical answer, will not be revised. When LLM refinement is on, one is_final: true, speech_final: false event is followed by one is_final: true, speech_final: true. When refinement is off, every final is speech_final: true. See Canonical-answer semantics.true for interim results only.Monotonically increasing counter per session.
Duration (seconds) of the audio segment transcribed.
Seconds from processing start to result ready (excludes queue time).
Word-level timestamps
[{word, start, end, confidence}]. Note: the field is confidence, not probability. Present on finals; empty on interims.Timestamp (ms) of last word in the utterance.
Internal mode string — useful for debugging.
Canonical-answer semantics: speech_final
is_final and speech_final are NOT identical when LLM refinement is active — they split into two distinct meanings:
is_final | speech_final | Meaning |
|---|---|---|
false | false | Interim partial — text may still change as more audio arrives. |
true | false | End-of-speech reached, dict-corrected text. The LLM is still processing. A follow-up event with speech_final: true will arrive shortly with the canonical text. Only emitted when refinement is active for this utterance. |
true | true | Canonical answer. Definitive, will not be revised. LLM-refined text when context was supplied, otherwise the original ASR text. |
sentence_id is echoed across both phases so clients can reconcile.
Canonical event example (after LLM refinement):
- Exactly one canonical event per utterance. When refinement is on, you get two
transcriptionevents per utterance (first emit + canonical). When refinement is off, you get one (speech_final: true). Never zero, never three. - Same
sentence_idacross both phases. Reconcile on that key. - The canonical always arrives. Consumers waiting on
speech_final: truenever hang. sentence_idordering is preserved per session, but canonicals are NOT guaranteed to arrive insentence_idorder when LLM is on — two utterances finalizing close in time may complete refinement out of order. Key onsentence_id, not arrival order.words[]corresponds to the original ASR output on both phases — the LLM does not realign tokens. Usewords[]for word-level timing,textfor display.
speech_final: false) to NLU immediately for fast intent dispatch — don’t wait for canonical. If your NLU benefits from proper-noun accuracy (name-spelling slots, drug-name lookup), run a second-pass call on the canonical (speech_final: true) text and reconcile on sentence_id.
Legacy
refined event. Earlier builds emitted a separate refined event ~400 ms after the final instead of a second transcription. The 60db /ws/stt proxy transparently handles both shapes — if you’re still seeing refined events in the wire trace, upstream workers haven’t been restarted onto the two-phase build yet. New client code should target the two-phase flow only; refined is accepted but deprecated.language_changed — After config message changes language
mode_changed — After config message changes continuous_mode
session_stopped — After stop is processed
error — Processing error
test_response — Reply to test ping
Complete Example
Audio Requirements
| Property | Telephony (μ-law) | Browser (PCM) |
|---|---|---|
| Encoding | mulaw (8-bit) | linear (16-bit) |
| Sample Rate | 8000 Hz | 16000, 24000, 44100, 48000 Hz |
| Chunk Size | 480 bytes (60ms) | 960-1920 bytes (60-120ms) |
| Channels | Mono (1 channel) | Mono (1 channel) |
Supported Languages
39 languages total, backed by Parakeet-TDT (25 European), Vaani-FastConformer (13 Indic + Hinglish), and FC-Arabic (MSA). Fetch the full catalog fromGET /stt/languages.
Not supported (explicit rejection, no silent aliasing): ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he. Arabic dialect tags (ar-eg, ar-lv, …) return dialect_not_supported — pass ar for best-effort MSA.
Common languages:
| Code | Language | Code | Language |
|---|---|---|---|
en | English | hi | Hindi |
bn | Bengali | es | Spanish |
fr | French | de | German |
gu | Gujarati | ta | Tamil |
te | Telugu | kn | Kannada |
ml | Malayalam | mr | Marathi |
pa | Punjabi | ar | Arabic |
Pricing
- Rate: $0.00000833 per second
- Minimum: $0.01 per session
- Billing: Per second of audio processed
Related
- TTS WebSocket - Text-to-Speech endpoint
- WebSocket Quick Start - Get started guide
- WebSocket Playground - Test in browser