Speech-to-Text
Speech to Text
Transcribe audio to text with auto language detection and optional speaker diarization
POST
Request
Headers
Bearer token with your API key
multipart/form-data
Form Data
Audio file to transcribe.
- Supported formats: WAV, MP3, M4A, OGG, FLAC, WebM, MP4 (audio track)
- Max file size: 10 MB
- Max duration: 1 hour
ISO 639-1 language code (e.g.
en, hi, ar, fr). Omit this field or pass auto to enable language auto-detection across the 39 supported languages. When specified and valid, skips language identification entirely for lowest latency.Enable speaker diarization. When
true, each segment of the response includes a speakers array identifying distinct speakers (SPEAKER_00, SPEAKER_01, …). Adds ~50–150 ms of processing latency per request.Free-form paragraph describing the session — domain, speakers, jargon, proper nouns you want preserved. When supplied, the server runs a background LLM refinement pass and the response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement.Example:
"Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls."On the Free plan,
context is silently stripped server-side and the
response includes warning_codes: ["llm_refinement_not_in_plan"]. The
transcript is produced without refinement. Upgrade to enable the gate.The REST
POST /stt form takes context as a plain string. The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.Custom vocabulary boost. CSV with optional
:weight per term, e.g.
"Acme:5,XYZ Pharma:8,off-side". Default weight 1.5, max 10. Used to
bias the recognizer toward acoustically-similar but spelled-differently
words (brand names, jargon). Up to 30 entries are surfaced to the LLM
hint; all entries participate in fuzzy / phonetic matching. Words replaced
by the boost appear in the response with boosted: true and original.Constrain language-ID candidates. CSV of ISO 639-1 codes, e.g.
"en,hi".
Narrower lists are faster. When omitted, the full supported set is used.Diarization tuning. Lower bound on detected speaker count. Read only when
diarize=true. Values <= 0 are clamped to null.Diarization tuning. Upper bound on detected speaker count. Read only when
diarize=true."none" | "word". Set to "word" to add start / end to each entry
in words and segments[].words.When
true, adds a per-word confidence (0-1) to the response.Devanagari / Latin script normalization for code-mixed audio.
LID flapping detector tuning — how aggressively to split on language change.
Default safe; expose only for advanced users.
Response
The endpoint passes through the response shape from the 60db STT backend. Key fields:Unique request identifier
Full normalized transcript (digit, entity, and bidi normalization applied)
Detected or specified ISO 639-1 language code (e.g.
"en"). null when no speech was detected.Full English language name (e.g.
"English")How the language was resolved:
"fast_path" (caller specified a single language), "lid_per_segment" (auto-detected per segment), "long_audio_chunked" (file > 90 s ran through the chunker — debug-only), or "mixed".Audio signal-to-noise ratio in decibels. Useful as an audio-quality
indicator:
>= 15 good, 0–15 fair, < 0 poor. When the recording is
too noisy, the audio is dropped before LID/ASR — the response has empty
text and warning_codes includes low_snr_dropped (no charge).Audio duration in seconds
Server processing time in milliseconds
Real-time factor (processing_ms / (duration_sec × 1000))
Array of utterance-level segments. Each segment has
{start, end, language, language_name, text, confidence, words[]}. When diarize=true, segments also include a speakers array. When the request ran through the long-audio chunker (language_source == "long_audio_chunked"), each segment also includes a debug-only chunk_idx integer (zero-based index of the chunk this segment came from).Flat word-level list across all segments. Each word has
{word, start, end, confidence?, boosted?, original?}. confidence is included when include_confidence=true. boosted: true and original are present when the keyword/context-terms boost replaced this word — the segment-level text is already rebuilt from boosted words upstream so no client-side stitching is required.Non-fatal warnings. Each item has
{code, message, affected_segments}. Common codes include no_speech_detected, inline_code_switch_partial, low_snr_dropped, llm_refinement_not_in_plan.Flat list of the
code values from warnings, for quick checks.| Code | Meaning |
|---|---|
no_speech_detected | Audio processed but contained no speech (silence / music). Not an error. |
low_snr_dropped | Audio was dropped before LID/ASR because SNR was below the floor. Response text is empty. No credits charged for this request. |
llm_refinement_not_in_plan | The context field was provided but the active plan does not include LLM refinement. The field was ignored; transcription proceeded without refinement. |
inline_code_switch_partial | Some words inside a segment were transcribed in a different language than the segment label. |
Internal language detection metadata:
{mode, candidates[], segment_count, lid_calls}Errors
| Status | error_code | When | Retry guidance |
|---|---|---|---|
| 401 | UNAUTHENTICATED | Auth missing or invalid | Don’t retry without credentials |
| 402 | INSUFFICIENT_CREDITS / ZERO_BALANCE | Workspace wallet | Top up |
| 429 | STT_CONCURRENCY_LIMIT | Per-user concurrency cap reached (counted across REST + WS combined). details.limit carries the active cap. | Retry after an in-flight request finishes; do not auto-retry without backoff |
| 429 | STT_UPSTREAM_RATE_LIMIT | Upstream STT service is rate-limiting | Honor the Retry-After HTTP header (seconds) rather than the JSON body |
| 499 | STT_CLIENT_CANCELLED | Client closed the connection before the response was returned. The upstream call was aborted. | Intentional; no retry. No charge. |
| 503 | STT_UPSTREAM_UNAVAILABLE | Upstream STT service returned a 5xx | Retry with exponential backoff (1s → 2s → 4s …) |
429 STT_CONCURRENCY_LIMIT
429 STT_UPSTREAM_RATE_LIMIT
499 STT_CLIENT_CANCELLED
503 STT_UPSTREAM_UNAVAILABLE
499 is non-standard but is used to signal “client closed connection
before response was sent” (nginx convention). Treat as expected when the
user cancels — wire AbortController.abort() to your cancel button.Notes
- Auto-detect: The most reliable way to enable auto-detect is to omit the
languagefield entirely. Passing"auto"is also accepted and treated identically. - Speaker labels: When
diarize=true, raw speaker IDs look likeSPEAKER_00,SPEAKER_01. Client UIs typically re-label these as “Speaker 1”, “Speaker 2” in order of first appearance for readability. - Empty transcript: A successful response with
text: ""andwarning_codes: ["no_speech_detected"]means the upload was processed but contained no speech (silence, noise, or music only). This is not an error — do not retry. - Low-SNR drop: A response with
text: ""andwarning_codes: ["low_snr_dropped"]means the audio was rejected upstream before LID/ASR because it was too noisy. No credits are charged. Surface “Audio too noisy — try recording in a quieter environment” rather than treating as a real result. - Diarization surcharge: When
diarize=trueis passed (or speakers are detected in the response), the request incurs a +30% surcharge on top of the base STT rate to cover GPU diarization cost. - Refunds: Empty transcripts caused by
low_snr_dropped, client-aborted requests (499), and upstream errors (503) are not charged.