Speech to Text

Request

Headers

Authorization

string

required

Bearer token with your API key

Content-Type

string

required

multipart/form-data

Form Data

file

required

Audio file to transcribe.

Supported formats: WAV, MP3, M4A, OGG, FLAC, WebM, MP4 (audio track)
Max file size: 10 MB
Max duration: 1 hour

language

string

ISO 639-1 language code (e.g. en, hi, ar, fr). Omit this field or pass auto to enable language auto-detection across the 39 supported languages. When specified and valid, skips language identification entirely for lowest latency.

diarize

boolean

default:"false"

Enable speaker diarization. When true, each segment of the response includes a speakers array identifying distinct speakers (SPEAKER_00, SPEAKER_01, …). Adds ~50–150 ms of processing latency per request.

context

string

Free-form paragraph describing the session — domain, speakers, jargon, proper nouns you want preserved. When supplied, the server runs a background LLM refinement pass and the response text is polished for proper nouns, filler removal, and punctuation. Omit to skip refinement.Example:

"Cricket coaching session. Players: Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan. Discussing batting technique, stamina, running between wickets, off-side balls."

On the Free plan, context is silently stripped server-side and the response includes warning_codes: ["llm_refinement_not_in_plan"]. The transcript is produced without refinement. Upgrade to enable the gate.

The REST POST /stt form takes context as a plain string. The WebSocket /ws/stt endpoint takes a structured {general, text, terms} object instead — see the WebSocket STT reference.

keywords

string

Custom vocabulary boost. CSV with optional :weight per term, e.g. "Acme:5,XYZ Pharma:8,off-side". Default weight 1.5, max 10. Used to bias the recognizer toward acoustically-similar but spelled-differently words (brand names, jargon). Up to 30 entries are surfaced to the LLM hint; all entries participate in fuzzy / phonetic matching. Words replaced by the boost appear in the response with boosted: true and original.

languages

string

Constrain language-ID candidates. CSV of ISO 639-1 codes, e.g. "en,hi". Narrower lists are faster. When omitted, the full supported set is used.

min_speakers

integer

Diarization tuning. Lower bound on detected speaker count. Read only when diarize=true. Values <= 0 are clamped to null.

max_speakers

integer

Diarization tuning. Upper bound on detected speaker count. Read only when diarize=true.

return_timestamps

string

default:"none"

"none" | "word". Set to "word" to add start / end to each entry in words and segments[].words.

include_confidence

boolean

default:"false"

When true, adds a per-word confidence (0-1) to the response.

script_correction

boolean

default:"false"

Devanagari / Latin script normalization for code-mixed audio.

min_split_sec

number

LID flapping detector tuning — how aggressively to split on language change. Default safe; expose only for advanced users.

Response

The endpoint passes through the response shape from the 60db STT backend. Key fields:

request_id

string

Unique request identifier

text

string

Full normalized transcript (digit, entity, and bidi normalization applied)

language

string

Detected or specified ISO 639-1 language code (e.g. "en"). null when no speech was detected.

language_name

string

Full English language name (e.g. "English")

language_source

string

How the language was resolved: "fast_path" (caller specified a single language), "lid_per_segment" (auto-detected per segment), "long_audio_chunked" (file > 90 s ran through the chunker — debug-only), or "mixed".

snr_db

number

Audio signal-to-noise ratio in decibels. Useful as an audio-quality indicator: >= 15 good, 0–15 fair, < 0 poor. When the recording is too noisy, the audio is dropped before LID/ASR — the response has empty text and warning_codes includes low_snr_dropped (no charge).

duration_sec

number

Audio duration in seconds

processing_ms

number

Server processing time in milliseconds

rtf

number

Real-time factor (processing_ms / (duration_sec × 1000))

segments

array

Array of utterance-level segments. Each segment has {start, end, language, language_name, text, confidence, words[]}. When diarize=true, segments also include a speakers array. When the request ran through the long-audio chunker (language_source == "long_audio_chunked"), each segment also includes a debug-only chunk_idx integer (zero-based index of the chunk this segment came from).

words

array

Flat word-level list across all segments. Each word has {word, start, end, confidence?, boosted?, original?}. confidence is included when include_confidence=true. boosted: true and original are present when the keyword/context-terms boost replaced this word — the segment-level text is already rebuilt from boosted words upstream so no client-side stitching is required.

warnings

array

Non-fatal warnings. Each item has {code, message, affected_segments}. Common codes include no_speech_detected, inline_code_switch_partial, low_snr_dropped, llm_refinement_not_in_plan.

warning_codes

array

Flat list of the code values from warnings, for quick checks.

Code	Meaning
`no_speech_detected`	Audio processed but contained no speech (silence / music). Not an error.
`low_snr_dropped`	Audio was dropped before LID/ASR because SNR was below the floor. Response `text` is empty. No credits charged for this request.
`llm_refinement_not_in_plan`	The `context` field was provided but the active plan does not include LLM refinement. The field was ignored; transcription proceeded without refinement.
`inline_code_switch_partial`	Some words inside a segment were transcribed in a different language than the segment label.

language_detection

object

Internal language detection metadata: {mode, candidates[], segment_count, lid_calls}

curl -X POST https://api.60db.ai/stt \
  -H "Authorization: Bearer your-api-key" \
  -F "[email protected]" \
  -F "language=auto" \
  -F "diarize=true" \
  -F "context=Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique."

{
  "request_id": "req_6822626d028743f5942526e0c08fa60c",
  "language": "en",
  "language_name": "English",
  "languages": null,
  "language_source": "fast_path",
  "duration_sec": 5.2,
  "processing_ms": 185,
  "rtf": 0.036,
  "text": "Hello, this is a test of the speech to text API. It works great!",
  "segments": [
    {
      "start": 0.0,
      "end": 3.1,
      "language": "en",
      "language_name": "English",
      "text": "Hello, this is a test of the speech to text API.",
      "confidence": 0.92,
      "words": [
        { "word": "Hello", "start": 0.0, "end": 0.32, "confidence": 0.94 },
        { "word": "this",  "start": 0.35, "end": 0.52, "confidence": 0.93 }
      ],
      "speakers": [
        { "speaker": "SPEAKER_00", "start": 0.0, "end": 3.1 }
      ]
    },
    {
      "start": 3.1,
      "end": 5.2,
      "language": "en",
      "language_name": "English",
      "text": "It works great!",
      "confidence": 0.89,
      "words": [],
      "speakers": [
        { "speaker": "SPEAKER_01", "start": 3.1, "end": 5.2 }
      ]
    }
  ],
  "words": [],
  "warnings": [],
  "warning_codes": [],
  "language_detection": {
    "mode": "fast_path",
    "candidates": ["en"],
    "segment_count": 2,
    "lid_calls": 0
  }
}

Errors

Status	`error_code`	When	Retry guidance
401	`UNAUTHENTICATED`	Auth missing or invalid	Don’t retry without credentials
402	`INSUFFICIENT_CREDITS` / `ZERO_BALANCE`	Workspace wallet	Top up
429	`STT_CONCURRENCY_LIMIT`	Per-user concurrency cap reached (counted across REST + WS combined). `details.limit` carries the active cap.	Retry after an in-flight request finishes; do not auto-retry without backoff
429	`STT_UPSTREAM_RATE_LIMIT`	Upstream STT service is rate-limiting	Honor the `Retry-After` HTTP header (seconds) rather than the JSON body
499	`STT_CLIENT_CANCELLED`	Client closed the connection before the response was returned. The upstream call was aborted.	Intentional; no retry. No charge.
503	`STT_UPSTREAM_UNAVAILABLE`	Upstream STT service returned a 5xx	Retry with exponential backoff (1s → 2s → 4s …)

429 STT_CONCURRENCY_LIMIT

{
  "success": false,
  "error_code": "STT_CONCURRENCY_LIMIT",
  "message": "Too many concurrent STT requests for this user",
  "details": {
    "limit": 8,
    "retry_hint": "Wait for an in-flight request to complete"
  }
}

429 STT_UPSTREAM_RATE_LIMIT

HTTP/1.1 429 Too Many Requests
Retry-After: 30
Content-Type: application/json

{
  "success": false,
  "error_code": "STT_UPSTREAM_RATE_LIMIT",
  "message": "Upstream STT rate-limited",
  "details": { "retry_after": "30" }
}

499 STT_CLIENT_CANCELLED

{
  "success": false,
  "error_code": "STT_CLIENT_CANCELLED",
  "message": "Request cancelled by client"
}

503 STT_UPSTREAM_UNAVAILABLE

{
  "success": false,
  "error_code": "STT_UPSTREAM_UNAVAILABLE",
  "message": "Upstream STT service unavailable"
}

499 is non-standard but is used to signal “client closed connection before response was sent” (nginx convention). Treat as expected when the user cancels — wire AbortController.abort() to your cancel button.

Notes

Auto-detect: The most reliable way to enable auto-detect is to omit the language field entirely. Passing "auto" is also accepted and treated identically.
Speaker labels: When diarize=true, raw speaker IDs look like SPEAKER_00, SPEAKER_01. Client UIs typically re-label these as “Speaker 1”, “Speaker 2” in order of first appearance for readability.
Empty transcript: A successful response with text: "" and warning_codes: ["no_speech_detected"] means the upload was processed but contained no speech (silence, noise, or music only). This is not an error — do not retry.
Low-SNR drop: A response with text: "" and warning_codes: ["low_snr_dropped"] means the audio was rejected upstream before LID/ASR because it was too noisy. No credits are charged. Surface “Audio too noisy — try recording in a quieter environment” rather than treating as a real result.
Diarization surcharge: When diarize=true is passed (or speakers are detected in the response), the request incurs a +30% surcharge on top of the base STT rate to cover GPU diarization cost.
Refunds: Empty transcripts caused by low_snr_dropped, client-aborted requests (499), and upstream errors (503) are not charged.

API Documentation

WebSocket Reference

Text-to-Speech

Speech-to-Text

LLM & Chat

Memory & RAG

Models

Voices

API Keys

Webhooks

Workspaces

Billing

Analytics

Speech to Text

Request

Headers

Form Data

Response

Errors

Notes

API Documentation

WebSocket Reference

Text-to-Speech

Speech-to-Text

LLM & Chat

Memory & RAG

Models

Voices

API Keys

Webhooks

Workspaces

Billing

Analytics

Documentation Index

​Request

​Headers

​Form Data

​Response

​Errors

​Notes

Request

Headers

Form Data

Response

Errors

Notes