Skip to main content
POST
/
stt
curl -X POST https://api.60db.ai/stt \
  -H "Authorization: Bearer your-api-key" \
  -F "[email protected]" \
  -F "language=auto" \
  -F "diarize=true" \
  -F "context=Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique."
{
  "request_id": "req_6822626d028743f5942526e0c08fa60c",
  "language": "en",
  "language_name": "English",
  "languages": null,
  "language_source": "fast_path",
  "duration_sec": 5.2,
  "processing_ms": 185,
  "rtf": 0.036,
  "text": "Hello, this is a test of the speech to text API. It works great!",
  "segments": [
    {
      "start": 0.0,
      "end": 3.1,
      "language": "en",
      "language_name": "English",
      "text": "Hello, this is a test of the speech to text API.",
      "confidence": 0.92,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 },
        { "word": "this",  "start": 0.35, "end": 0.52, "confidence": 0.93 }
      ],
      "speakers": [
        { "speaker": "SPEAKER_00", "start": 0.0, "end": 3.1 }
      ]
    },
    {
      "start": 3.1,
      "end": 5.2,
      "language": "en",
      "language_name": "English",
      "text": "It works great!",
      "confidence": 0.89,
      "words": [],
      "speakers": [
        { "speaker": "SPEAKER_01", "start": 3.1, "end": 5.2 }
      ]
    }
  ],
  "words": [],
  "warnings": [],
  "warning_codes": [],
  "language_detection": {
    "mode": "fast_path",
    "candidates": ["en"],
    "segment_count": 2,
    "lid_calls": 0
  }
}

Documentation Index

Fetch the complete documentation index at: https://docs.60db.ai/llms.txt

Use this file to discover all available pages before exploring further.

Request

Headers

Authorization
string
required
Bearer token with your API key
Content-Type
string
required
multipart/form-data

Form Data

file
file
required
Audio file to transcribe.
  • Supported formats: WAV, MP3, M4A, OGG, FLAC, WebM, MP4 (audio track)
  • Max file size: 10 MB
  • Max duration: 1 hour
language
string
ISO 639-1 language code (e.g. en, hi, ar, fr). Omit this field or pass auto to enable language auto-detection across the 39 supported languages. When specified and valid, skips language identification entirely for lowest latency.
diarize
boolean
default:"false"
Enable speaker diarization. When true, each segment of the response includes a speakers array identifying distinct speakers (SPEAKER_00, SPEAKER_01, …). Adds ~50–150 ms of processing latency per request.
context
string
Free-form paragraph describing the session — domain, speakers, jargon, proper nouns you want preserved. When supplied, the server runs a background LLM refinement pass and the response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement.Example: "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls."
On the Free plan, context is silently stripped server-side and the response includes warning_codes: ["llm_refinement_not_in_plan"]. The transcript is produced without refinement. Upgrade to enable the gate.
The REST POST /stt form takes context as a plain string. The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.
keywords
string
Custom vocabulary boost. CSV with optional :weight per term, e.g. "Acme:5,XYZ Pharma:8,off-side". Default weight 1.5, max 10. Used to bias the recognizer toward acoustically-similar but spelled-differently words (brand names, jargon). Up to 30 entries are surfaced to the LLM hint; all entries participate in fuzzy / phonetic matching. Words replaced by the boost appear in the response with boosted: true and original.
languages
string
Constrain language-ID candidates. CSV of ISO 639-1 codes, e.g. "en,hi". Narrower lists are faster. When omitted, the full supported set is used.
min_speakers
integer
Diarization tuning. Lower bound on detected speaker count. Read only when diarize=true. Values <= 0 are clamped to null.
max_speakers
integer
Diarization tuning. Upper bound on detected speaker count. Read only when diarize=true.
return_timestamps
string
default:"none"
"none" | "word". Set to "word" to add start / end to each entry in words and segments[].words.
include_confidence
boolean
default:"false"
When true, adds a per-word confidence (0-1) to the response.
script_correction
boolean
default:"false"
Devanagari / Latin script normalization for code-mixed audio.
min_split_sec
number
LID flapping detector tuning — how aggressively to split on language change. Default safe; expose only for advanced users.

Response

The endpoint passes through the response shape from the 60db STT backend. Key fields:
request_id
string
Unique request identifier
text
string
Full normalized transcript (digit, entity, and bidi normalization applied)
language
string
Detected or specified ISO 639-1 language code (e.g. "en"). null when no speech was detected.
language_name
string
Full English language name (e.g. "English")
language_source
string
How the language was resolved: "fast_path" (caller specified a single language), "lid_per_segment" (auto-detected per segment), "long_audio_chunked" (file > 90 s ran through the chunker — debug-only), or "mixed".
snr_db
number
Audio signal-to-noise ratio in decibels. Useful as an audio-quality indicator: >= 15 good, 0–15 fair, < 0 poor. When the recording is too noisy, the audio is dropped before LID/ASR — the response has empty text and warning_codes includes low_snr_dropped (no charge).
duration_sec
number
Audio duration in seconds
processing_ms
number
Server processing time in milliseconds
rtf
number
Real-time factor (processing_ms / (duration_sec × 1000))
segments
array
Array of utterance-level segments. Each segment has {start, end, language, language_name, text, confidence, words[]}. When diarize=true, segments also include a speakers array. When the request ran through the long-audio chunker (language_source == "long_audio_chunked"), each segment also includes a debug-only chunk_idx integer (zero-based index of the chunk this segment came from).
words
array
Flat word-level list across all segments. Each word has {word, start, end, confidence?, boosted?, original?}. confidence is included when include_confidence=true. boosted: true and original are present when the keyword/context-terms boost replaced this word — the segment-level text is already rebuilt from boosted words upstream so no client-side stitching is required.
warnings
array
Non-fatal warnings. Each item has {code, message, affected_segments}. Common codes include no_speech_detected, inline_code_switch_partial, low_snr_dropped, llm_refinement_not_in_plan.
warning_codes
array
Flat list of the code values from warnings, for quick checks.
CodeMeaning
no_speech_detectedAudio processed but contained no speech (silence / music). Not an error.
low_snr_droppedAudio was dropped before LID/ASR because SNR was below the floor. Response text is empty. No credits charged for this request.
llm_refinement_not_in_planThe context field was provided but the active plan does not include LLM refinement. The field was ignored; transcription proceeded without refinement.
inline_code_switch_partialSome words inside a segment were transcribed in a different language than the segment label.
language_detection
object
Internal language detection metadata: {mode, candidates[], segment_count, lid_calls}
curl -X POST https://api.60db.ai/stt \
  -H "Authorization: Bearer your-api-key" \
  -F "[email protected]" \
  -F "language=auto" \
  -F "diarize=true" \
  -F "context=Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique."
{
  "request_id": "req_6822626d028743f5942526e0c08fa60c",
  "language": "en",
  "language_name": "English",
  "languages": null,
  "language_source": "fast_path",
  "duration_sec": 5.2,
  "processing_ms": 185,
  "rtf": 0.036,
  "text": "Hello, this is a test of the speech to text API. It works great!",
  "segments": [
    {
      "start": 0.0,
      "end": 3.1,
      "language": "en",
      "language_name": "English",
      "text": "Hello, this is a test of the speech to text API.",
      "confidence": 0.92,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 },
        { "word": "this",  "start": 0.35, "end": 0.52, "confidence": 0.93 }
      ],
      "speakers": [
        { "speaker": "SPEAKER_00", "start": 0.0, "end": 3.1 }
      ]
    },
    {
      "start": 3.1,
      "end": 5.2,
      "language": "en",
      "language_name": "English",
      "text": "It works great!",
      "confidence": 0.89,
      "words": [],
      "speakers": [
        { "speaker": "SPEAKER_01", "start": 3.1, "end": 5.2 }
      ]
    }
  ],
  "words": [],
  "warnings": [],
  "warning_codes": [],
  "language_detection": {
    "mode": "fast_path",
    "candidates": ["en"],
    "segment_count": 2,
    "lid_calls": 0
  }
}

Errors

Statuserror_codeWhenRetry guidance
401UNAUTHENTICATEDAuth missing or invalidDon’t retry without credentials
402INSUFFICIENT_CREDITS / ZERO_BALANCEWorkspace walletTop up
429STT_CONCURRENCY_LIMITPer-user concurrency cap reached (counted across REST + WS combined). details.limit carries the active cap.Retry after an in-flight request finishes; do not auto-retry without backoff
429STT_UPSTREAM_RATE_LIMITUpstream STT service is rate-limitingHonor the Retry-After HTTP header (seconds) rather than the JSON body
499STT_CLIENT_CANCELLEDClient closed the connection before the response was returned. The upstream call was aborted.Intentional; no retry. No charge.
503STT_UPSTREAM_UNAVAILABLEUpstream STT service returned a 5xxRetry with exponential backoff (1s → 2s → 4s …)
429 STT_CONCURRENCY_LIMIT
{
  "success": false,
  "error_code": "STT_CONCURRENCY_LIMIT",
  "message": "Too many concurrent STT requests for this user",
  "details": {
    "limit": 8,
    "retry_hint": "Wait for an in-flight request to complete"
  }
}
429 STT_UPSTREAM_RATE_LIMIT
HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "success": false,
  "error_code": "STT_UPSTREAM_RATE_LIMIT",
  "message": "Upstream STT rate-limited",
  "details": { "retry_after": "30" }
}
499 STT_CLIENT_CANCELLED
{
  "success": false,
  "error_code": "STT_CLIENT_CANCELLED",
  "message": "Request cancelled by client"
}
503 STT_UPSTREAM_UNAVAILABLE
{
  "success": false,
  "error_code": "STT_UPSTREAM_UNAVAILABLE",
  "message": "Upstream STT service unavailable"
}
499 is non-standard but is used to signal “client closed connection before response was sent” (nginx convention). Treat as expected when the user cancels — wire AbortController.abort() to your cancel button.

Notes

  • Auto-detect: The most reliable way to enable auto-detect is to omit the language field entirely. Passing "auto" is also accepted and treated identically.
  • Speaker labels: When diarize=true, raw speaker IDs look like SPEAKER_00, SPEAKER_01. Client UIs typically re-label these as “Speaker 1”, “Speaker 2” in order of first appearance for readability.
  • Empty transcript: A successful response with text: "" and warning_codes: ["no_speech_detected"] means the upload was processed but contained no speech (silence, noise, or music only). This is not an error — do not retry.
  • Low-SNR drop: A response with text: "" and warning_codes: ["low_snr_dropped"] means the audio was rejected upstream before LID/ASR because it was too noisy. No credits are charged. Surface “Audio too noisy — try recording in a quieter environment” rather than treating as a real result.
  • Diarization surcharge: When diarize=true is passed (or speakers are detected in the response), the request incurs a +30% surcharge on top of the base STT rate to cover GPU diarization cost.
  • Refunds: Empty transcripts caused by low_snr_dropped, client-aborted requests (499), and upstream errors (503) are not charged.