Skip to main content

Speech-to-Text (STT)

Transcribe audio files to text with optional speaker diarization.

Available Tools

ToolDescription
sixtydb_stt_transcribeTranscribe audio to text
sixtydb_stt_logsView transcription history
sixtydb_stt_getGet specific transcription details

Transcribe Audio

Convert audio to text via the sixtydb_stt_transcribe tool:
{
  "audio_url": "https://example.com/meeting.mp3",
  "language": "auto",
  "diarize": true
}

Auto-detect vs explicit language

  • Omit language (or pass "auto") to enable auto-detection across all 39 supported languages. The MCP shim strips "auto" before forwarding so the server’s language identification runs.
  • Pass a single ISO 639-1 code (e.g. "hi", "en", "ar") to skip language identification and run the fast path for that language.
  • Do not pass unsupported codes (ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he) or Arabic dialect tags (ar-eg, ar-lv, …) — they return an unsupported_language error. For non-MSA Arabic audio pass "ar" for best-effort MSA transcription.

Parameters

ParameterTypeDefaultDescription
audio_urlstringRequiredURL to audio file (downloaded and forwarded as multipart to POST /stt; max 25 MB)
languagestringautoISO 639-1 code, or "auto" / omit for auto-detect
diarizebooleanfalseEnable pyannote speaker diarization — adds a speakers array to each segment with SPEAKER_00, SPEAKER_01, … labels
contextstringFree-form paragraph describing the session (domain, speakers, jargon). When supplied, the server runs a background LLM refinement pass; response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement.
response_formatstring"markdown""markdown" or "json"

Context string example

Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls.
context is a plain string on the REST POST /stt tool (this page). The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.

Response shape (JSON)

{
  "request_id": "req_...",
  "text": "Hello, thanks for joining...",
  "language": "en",
  "language_name": "English",
  "duration_sec": 5.2,
  "segments": [
    {
      "start": 0.0,
      "end": 3.1,
      "text": "Hello, thanks for joining the call today.",
      "confidence": 0.92,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 }
      ],
      "speakers": [
        { "speaker": "SPEAKER_00", "start": 0.0, "end": 3.1 }
      ]
    }
  ],
  "warning_codes": []
}
Empty-speech signal: A successful response with text: "" and warning_codes: ["no_speech_detected"] means the audio contained no speech. This is not an error — do not retry.

Usage in Claude

You: Transcribe this meeting audio
You: Transcribe this with speaker identification
You: Transcribe only in Hindi (use language "hi")

Supported Audio Formats

  • WAV, MP3, M4A, OGG, FLAC, WebM, MP4 audio track
  • Max file size: 25 MB
  • Max duration: 1 hour
  • Recommended: 16 kHz+ sample rate

Supported Languages

39 languages total — fetch the live catalog from the sixtydb_stt_languages tool (or the REST GET /stt/languages endpoint). Includes 25 European languages, 13 Indic languages with English code-switching, and Arabic MSA. See the Get STT Languages reference for the full list.
  • STT Models — exposed STT model catalog (currently 60db-stt-v01)
  • WebSocket STT — real-time streaming variant (note: the WS form uses languages: null for auto-detect, not "auto")
  • TTS — Text-to-speech