Speech-to-Text (STT)
Transcribe audio files to text with optional speaker diarization.Available Tools
| Tool | Description |
|---|---|
sixtydb_stt_transcribe | Transcribe audio to text |
sixtydb_stt_logs | View transcription history |
sixtydb_stt_get | Get specific transcription details |
Transcribe Audio
Convert audio to text via thesixtydb_stt_transcribe tool:
Auto-detect vs explicit language
- Omit
language(or pass"auto") to enable auto-detection across all 39 supported languages. The MCP shim strips"auto"before forwarding so the server’s language identification runs. - Pass a single ISO 639-1 code (e.g.
"hi","en","ar") to skip language identification and run the fast path for that language. - Do not pass unsupported codes (
ur,ja,ko,zh,th,vi,id,tl,sw,tr,fa,he) or Arabic dialect tags (ar-eg,ar-lv, …) — they return anunsupported_languageerror. For non-MSA Arabic audio pass"ar"for best-effort MSA transcription.
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
audio_url | string | Required | URL to audio file (downloaded and forwarded as multipart to POST /stt; max 25 MB) |
language | string | auto | ISO 639-1 code, or "auto" / omit for auto-detect |
diarize | boolean | false | Enable pyannote speaker diarization — adds a speakers array to each segment with SPEAKER_00, SPEAKER_01, … labels |
context | string | — | Free-form paragraph describing the session (domain, speakers, jargon). When supplied, the server runs a background LLM refinement pass; response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement. |
response_format | string | "markdown" | "markdown" or "json" |
Context string example
context is a plain string on the REST POST /stt tool (this page). The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.Response shape (JSON)
text: "" and warning_codes: ["no_speech_detected"] means the audio contained no speech. This is not an error — do not retry.
Usage in Claude
Supported Audio Formats
- WAV, MP3, M4A, OGG, FLAC, WebM, MP4 audio track
- Max file size: 25 MB
- Max duration: 1 hour
- Recommended: 16 kHz+ sample rate
Supported Languages
39 languages total — fetch the live catalog from thesixtydb_stt_languages tool (or the REST GET /stt/languages endpoint). Includes 25 European languages, 13 Indic languages with English code-switching, and Arabic MSA. See the Get STT Languages reference for the full list.
Related
- STT Models — exposed STT model catalog (currently
60db-stt-v01) - WebSocket STT — real-time streaming variant (note: the WS form uses
languages: nullfor auto-detect, not"auto") - TTS — Text-to-speech