Speech-to-Text (STT)
Commands
Transcribe Audio File
--language auto.
Specify Language
60db stt:languages). Supported codes include en, hi, bn, mr, pa, gu, or, as, ne, ta, te, kn, ml, sa, ar, and 25 European languages.
Auto-detect (explicit)
--language auto is treated as “omit” — the CLI strips it before sending so the server runs its language identification across all 39 supported languages. You can also simply omit the --language flag entirely for the same behavior.
Enable Speaker Diarization
speakers array with labels like SPEAKER_00, SPEAKER_01. Adds ~50–150 ms of processing latency.
Add Context (optional refinement)
--context is supplied, the server runs a background LLM refinement pass and the returned text is polished for proper nouns, filler removal, and punctuation. Omit --context to skip refinement.
--context takes a plain string — the REST /stt endpoint shape. The WebSocket /ws/stt endpoint accepts a structured {general, text, terms} object instead; see the WebSocket STT reference.List Available Languages
auto entry, sourced from GET /stt/languages.
Options
-f, --file <path>— Audio file path (required; max 25 MB; WAV / MP3 / M4A / OGG / FLAC / WebM)-l, --language <code>— ISO 639-1 language code (e.g.en,hi,ar). Omit or passautofor auto-detection across the 39 supported languages.--diarize <boolean>— Enable speaker diarization (default:false)--context <text>— Free-form paragraph describing the session (domain, speakers, jargon). Enables server-side LLM refinement of the response text.
Examples
STT — Transcribe Audio
Notes
- Do not pass unsupported language codes. The server explicitly rejects
ur,ja,ko,zh,th,vi,id,tl,sw,tr,fa,he(and Arabic dialect tags likear-eg) with anunsupported_languageerror. Useautoor omit the flag for these. - Auto-detect on REST vs WebSocket. The REST (
stt:transcribe) form flow accepts--language autoas a convenience. The streaming WebSocket form requireslanguages: null, not the literal string"auto". If you write your own WebSocket client, don’t forward the string"auto"— sendnull. - A successful response with empty
textandwarning_codes: ["no_speech_detected"]means the audio contained no speech (silence / music / noise). This is not an error — do not retry.