Skip to main content

Speech-to-Text (STT)

Commands

Transcribe Audio File

60db stt:transcribe --file audio.wav
By default the language is auto-detected. Equivalent to passing --language auto.

Specify Language

60db stt:transcribe --file audio.wav --language hi
Pass an ISO 639-1 code from the supported set (see 60db stt:languages). Supported codes include en, hi, bn, mr, pa, gu, or, as, ne, ta, te, kn, ml, sa, ar, and 25 European languages.

Auto-detect (explicit)

60db stt:transcribe --file meeting.wav --language auto
--language auto is treated as “omit” — the CLI strips it before sending so the server runs its language identification across all 39 supported languages. You can also simply omit the --language flag entirely for the same behavior.

Enable Speaker Diarization

60db stt:transcribe --file meeting.wav --diarize true
Each response segment will include a speakers array with labels like SPEAKER_00, SPEAKER_01. Adds ~50–150 ms of processing latency.

Add Context (optional refinement)

60db stt:transcribe --file visit.wav \
  --context "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique."
When --context is supplied, the server runs a background LLM refinement pass and the returned text is polished for proper nouns, filler removal, and punctuation. Omit --context to skip refinement.
--context takes a plain string — the REST /stt endpoint shape. The WebSocket /ws/stt endpoint accepts a structured {general, text, terms} object instead; see the WebSocket STT reference.

List Available Languages

60db stt:languages
Returns the 39-language catalog plus the auto entry, sourced from GET /stt/languages.

Options

  • -f, --file <path> — Audio file path (required; max 25 MB; WAV / MP3 / M4A / OGG / FLAC / WebM)
  • -l, --language <code> — ISO 639-1 language code (e.g. en, hi, ar). Omit or pass auto for auto-detection across the 39 supported languages.
  • --diarize <boolean> — Enable speaker diarization (default: false)
  • --context <text> — Free-form paragraph describing the session (domain, speakers, jargon). Enables server-side LLM refinement of the response text.

Examples

STT — Transcribe Audio

# Auto-detect language (simplest)
60db stt:transcribe --file meeting.wav

# Auto-detect, explicit form
60db stt:transcribe --file meeting.wav --language auto

# Force Hindi — skips language identification for lowest latency
60db stt:transcribe --file recording.wav --language hi

# Multi-speaker meeting with diarization + auto-detect
60db stt:transcribe --file board-call.wav --diarize true

# List supported languages
60db stt:languages

Notes

  • Do not pass unsupported language codes. The server explicitly rejects ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he (and Arabic dialect tags like ar-eg) with an unsupported_language error. Use auto or omit the flag for these.
  • Auto-detect on REST vs WebSocket. The REST (stt:transcribe) form flow accepts --language auto as a convenience. The streaming WebSocket form requires languages: null, not the literal string "auto". If you write your own WebSocket client, don’t forward the string "auto" — send null.
  • A successful response with empty text and warning_codes: ["no_speech_detected"] means the audio contained no speech (silence / music / noise). This is not an error — do not retry.