Transcribe audio to text with auto language detection and optional speaker diarization
Documentation Index
Fetch the complete documentation index at: https://docs.60db.ai/llms.txt
Use this file to discover all available pages before exploring further.
en, hi, ar, fr). Omit this field or pass auto to enable language auto-detection across the 39 supported languages. When specified and valid, skips language identification entirely for lowest latency.true, each segment of the response includes a speakers array identifying distinct speakers (SPEAKER_00, SPEAKER_01, …). Adds ~50–150 ms of processing latency per request."Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls."context is silently stripped server-side and the
response includes warning_codes: ["llm_refinement_not_in_plan"]. The
transcript is produced without refinement. Upgrade to enable the gate.POST /stt form takes context as a plain string. The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.:weight per term, e.g.
"Acme:5,XYZ Pharma:8,off-side". Default weight 1.5, max 10. Used to
bias the recognizer toward acoustically-similar but spelled-differently
words (brand names, jargon). Up to 30 entries are surfaced to the LLM
hint; all entries participate in fuzzy / phonetic matching. Words replaced
by the boost appear in the response with boosted: true and original."en,hi".
Narrower lists are faster. When omitted, the full supported set is used.diarize=true. Values <= 0 are clamped to null.diarize=true."none" | "word". Set to "word" to add start / end to each entry
in words and segments[].words.true, adds a per-word confidence (0-1) to the response."en"). null when no speech was detected."English")"fast_path" (caller specified a single language), "lid_per_segment" (auto-detected per segment), "long_audio_chunked" (file > 90 s ran through the chunker — debug-only), or "mixed".>= 15 good, 0–15 fair, < 0 poor. When the recording is
too noisy, the audio is dropped before LID/ASR — the response has empty
text and warning_codes includes low_snr_dropped (no charge).{start, end, language, language_name, text, confidence, words[]}. When diarize=true, segments also include a speakers array. When the request ran through the long-audio chunker (language_source == "long_audio_chunked"), each segment also includes a debug-only chunk_idx integer (zero-based index of the chunk this segment came from).{word, start, end, confidence?, boosted?, original?}. confidence is included when include_confidence=true. boosted: true and original are present when the keyword/context-terms boost replaced this word — the segment-level text is already rebuilt from boosted words upstream so no client-side stitching is required.{code, message, affected_segments}. Common codes include no_speech_detected, inline_code_switch_partial, low_snr_dropped, llm_refinement_not_in_plan.code values from warnings, for quick checks.| Code | Meaning |
|---|---|
no_speech_detected | Audio processed but contained no speech (silence / music). Not an error. |
low_snr_dropped | Audio was dropped before LID/ASR because SNR was below the floor. Response text is empty. No credits charged for this request. |
llm_refinement_not_in_plan | The context field was provided but the active plan does not include LLM refinement. The field was ignored; transcription proceeded without refinement. |
inline_code_switch_partial | Some words inside a segment were transcribed in a different language than the segment label. |
{mode, candidates[], segment_count, lid_calls}| Status | error_code | When | Retry guidance |
|---|---|---|---|
| 401 | UNAUTHENTICATED | Auth missing or invalid | Don’t retry without credentials |
| 402 | INSUFFICIENT_CREDITS / ZERO_BALANCE | Workspace wallet | Top up |
| 429 | STT_CONCURRENCY_LIMIT | Per-user concurrency cap reached (counted across REST + WS combined). details.limit carries the active cap. | Retry after an in-flight request finishes; do not auto-retry without backoff |
| 429 | STT_UPSTREAM_RATE_LIMIT | Upstream STT service is rate-limiting | Honor the Retry-After HTTP header (seconds) rather than the JSON body |
| 499 | STT_CLIENT_CANCELLED | Client closed the connection before the response was returned. The upstream call was aborted. | Intentional; no retry. No charge. |
| 503 | STT_UPSTREAM_UNAVAILABLE | Upstream STT service returned a 5xx | Retry with exponential backoff (1s → 2s → 4s …) |
499 is non-standard but is used to signal “client closed connection
before response was sent” (nginx convention). Treat as expected when the
user cancels — wire AbortController.abort() to your cancel button.language field entirely. Passing "auto" is also accepted and treated identically.diarize=true, raw speaker IDs look like SPEAKER_00, SPEAKER_01. Client UIs typically re-label these as “Speaker 1”, “Speaker 2” in order of first appearance for readability.text: "" and warning_codes: ["no_speech_detected"] means the upload was processed but contained no speech (silence, noise, or music only). This is not an error — do not retry.text: "" and warning_codes: ["low_snr_dropped"] means the audio was rejected upstream before LID/ASR because it was too noisy. No credits are charged. Surface “Audio too noisy — try recording in a quieter environment” rather than treating as a real result.diarize=true is passed (or speakers are detected in the response), the request incurs a +30% surcharge on top of the base STT rate to cover GPU diarization cost.low_snr_dropped, client-aborted requests (499), and upstream errors (503) are not charged.