Speech-to-Text (STT)

Commands

Transcribe Audio File

60db stt:transcribe --file audio.wav

By default the language is auto-detected. Equivalent to passing --language auto.

Specify Language

60db stt:transcribe --file audio.wav --language hi

Pass an ISO 639-1 code from the supported set (see 60db stt:languages). Supported codes include en, hi, bn, mr, pa, gu, or, as, ne, ta, te, kn, ml, sa, ar, and 25 European languages.

Auto-detect (explicit)

60db stt:transcribe --file meeting.wav --language auto

--language auto is treated as “omit” — the CLI strips it before sending so the server runs its language identification across all 39 supported languages. You can also simply omit the --language flag entirely for the same behavior.

Enable Speaker Diarization

60db stt:transcribe --file meeting.wav --diarize true

Each response segment will include a speakers array with labels like SPEAKER_00, SPEAKER_01. Adds ~50–150 ms of processing latency.

Add Context (optional refinement)

60db stt:transcribe --file visit.wav \
  --context "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique."

When --context is supplied, the server runs a background LLM refinement pass and the returned text is polished for proper nouns, filler removal, and punctuation. Omit --context to skip refinement.

--context takes a plain string — the REST /stt endpoint shape. The WebSocket /ws/stt endpoint accepts a structured {general, text, terms} object instead; see the WebSocket STT reference.

List Available Languages

60db stt:languages

Returns the 39-language catalog plus the auto entry, sourced from GET /stt/languages.

Options

-f, --file <path> — Audio file path (required; max 25 MB; WAV / MP3 / M4A / OGG / FLAC / WebM)
-l, --language <code> — ISO 639-1 language code (e.g. en, hi, ar). Omit or pass auto for auto-detection across the 39 supported languages.
--diarize <boolean> — Enable speaker diarization (default: false)
--context <text> — Free-form paragraph describing the session (domain, speakers, jargon). Enables server-side LLM refinement of the response text.

Examples

STT — Transcribe Audio

# Auto-detect language (simplest)
60db stt:transcribe --file meeting.wav

# Auto-detect, explicit form
60db stt:transcribe --file meeting.wav --language auto

# Force Hindi — skips language identification for lowest latency
60db stt:transcribe --file recording.wav --language hi

# Multi-speaker meeting with diarization + auto-detect
60db stt:transcribe --file board-call.wav --diarize true

# List supported languages
60db stt:languages

Notes

Do not pass unsupported language codes. The server explicitly rejects ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he (and Arabic dialect tags like ar-eg) with an unsupported_language error. Use auto or omit the flag for these.
Auto-detect on REST vs WebSocket. The REST (stt:transcribe) form flow accepts --language auto as a convenience. The streaming WebSocket form requires languages: null, not the literal string "auto". If you write your own WebSocket client, don’t forward the string "auto" — send null.
A successful response with empty text and warning_codes: ["no_speech_detected"] means the audio contained no speech (silence / music / noise). This is not an error — do not retry.

​Speech-to-Text (STT)

​Commands

​Transcribe Audio File

​Specify Language

​Auto-detect (explicit)

​Enable Speaker Diarization

​Add Context (optional refinement)

​List Available Languages

​Options

​Examples

​STT — Transcribe Audio

​Notes

Speech-to-Text (STT)

Commands

Transcribe Audio File

Specify Language

Auto-detect (explicit)

Enable Speaker Diarization

Add Context (optional refinement)

List Available Languages

Options

Examples

STT — Transcribe Audio

Notes