Speech-to-Text (STT)

Transcribe audio files to text with optional speaker diarization.

Available Tools

Tool	Description
`sixtydb_stt_transcribe`	Transcribe audio to text
`sixtydb_stt_logs`	View transcription history
`sixtydb_stt_get`	Get specific transcription details

Transcribe Audio

Convert audio to text via the sixtydb_stt_transcribe tool:

{
  "audio_url": "https://example.com/meeting.mp3",
  "language": "auto",
  "diarize": true
}

Auto-detect vs explicit language

Omit language (or pass "auto") to enable auto-detection across all 39 supported languages. The MCP shim strips "auto" before forwarding so the server’s language identification runs.
Pass a single ISO 639-1 code (e.g. "hi", "en", "ar") to skip language identification and run the fast path for that language.
Do not pass unsupported codes (ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he) or Arabic dialect tags (ar-eg, ar-lv, …) — they return an unsupported_language error. For non-MSA Arabic audio pass "ar" for best-effort MSA transcription.

Parameters

Parameter	Type	Default	Description
`audio_url`	string	Required	URL to audio file (downloaded and forwarded as multipart to `POST /stt`; max 25 MB)
`language`	string	auto	ISO 639-1 code, or `"auto"` / omit for auto-detect
`diarize`	boolean	`false`	Enable pyannote speaker diarization — adds a `speakers` array to each segment with `SPEAKER_00`, `SPEAKER_01`, … labels
`context`	string	—	Free-form paragraph describing the session (domain, speakers, jargon). When supplied, the server runs a background LLM refinement pass; response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement.
`response_format`	string	`"markdown"`	`"markdown"` or `"json"`

Context string example

Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls.

context is a plain string on the REST POST /stt tool (this page). The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.

Response shape (JSON)

{
  "request_id": "req_...",
  "text": "Hello, thanks for joining...",
  "language": "en",
  "language_name": "English",
  "duration_sec": 5.2,
  "segments": [
    {
      "start": 0.0,
      "end": 3.1,
      "text": "Hello, thanks for joining the call today.",
      "confidence": 0.92,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 }
      ],
      "speakers": [
        { "speaker": "SPEAKER_00", "start": 0.0, "end": 3.1 }
      ]
    }
  ],
  "warning_codes": []
}

Empty-speech signal: A successful response with text: "" and warning_codes: ["no_speech_detected"] means the audio contained no speech. This is not an error — do not retry.

Usage in Claude

You: Transcribe this meeting audio
You: Transcribe this with speaker identification
You: Transcribe only in Hindi (use language "hi")

Supported Audio Formats

WAV, MP3, M4A, OGG, FLAC, WebM, MP4 audio track
Max file size: 25 MB
Max duration: 1 hour
Recommended: 16 kHz+ sample rate

Supported Languages

39 languages total — fetch the live catalog from the sixtydb_stt_languages tool (or the REST GET /stt/languages endpoint). Includes 25 European languages, 13 Indic languages with English code-switching, and Arabic MSA. See the Get STT Languages reference for the full list.

STT Models — exposed STT model catalog (currently 60db-stt-v01)
WebSocket STT — real-time streaming variant (note: the WS form uses languages: null for auto-detect, not "auto")
TTS — Text-to-speech

​Speech-to-Text (STT)

​Available Tools

​Transcribe Audio

​Auto-detect vs explicit language

​Parameters

​Context string example

​Response shape (JSON)

​Usage in Claude

​Supported Audio Formats

​Supported Languages

​Related

Speech-to-Text (STT)

Available Tools

Transcribe Audio

Auto-detect vs explicit language

Parameters

Context string example

Response shape (JSON)

Usage in Claude

Supported Audio Formats

Supported Languages

Related