Skip to main content
Every product surface shares this page. Dated entries flag the surfaces they touched (REST, WebSocket, SDK, CLI, MCP, Proxy) so you can skim by what you integrate with.
2026-04-30
STT/TTS reliability + new features

Added

Longer request timeout (REST POST /stt).
  • Server-side request timeout raised from 120 s → 600 s; clients should match. The 25 MB / ~1-hour file-size cap remains.
Custom vocabulary boost.
  • New keywords form field on POST /stt. CSV with optional per-term weights, e.g. Acme:5,XYZ Pharma:8.
  • The same boost runs on the WebSocket path via the existing context.terms field — no client change required to benefit; existing terms payloads now produce boosted / original markers automatically.
  • Words replaced by the boost appear in the response with boosted: true and original: "<pre-boost word>".
Word-level timestamps and confidence.
  • return_timestamps=word adds per-word start / end to the REST response.
  • include_confidence=true adds per-word confidence (0-1).
Diarization controls.
  • min_speakers / max_speakers form fields on POST /stt.
Script correction.
  • script_correction=true for code-mixed Devanagari / Latin audio.
Audio quality indicator.
  • snr_db is now a top-level field on every STT response and a per-message field on WS transcription events. Surface as a “good / fair / poor” badge (>= 15 good, 0–15 fair, < 0 poor).

Limits

Per-user concurrency cap. Counted across REST + WS combined.
ServiceDefault
STT8
TTS5
Excess returns:
  • REST429 with error_code: STT_CONCURRENCY_LIMIT / TTS_CONCURRENCY_LIMIT and details.limit.
  • WebSocket — error frame followed by close code 1008. The cap releases when an in-flight request completes; no server-side queueing.

New error codes

Statuserror_codeWhenRetry guidance
429STT_CONCURRENCY_LIMIT / TTS_CONCURRENCY_LIMITPer-user concurrency capRetry after an in-flight request finishes
429STT_UPSTREAM_RATE_LIMITUpstream rate-limit pass-throughHonor the Retry-After HTTP header
499STT_CLIENT_CANCELLEDClient closed the connection mid-flightIntentional; no retry, no charge
503STT_UPSTREAM_UNAVAILABLEUpstream STT 5xxExponential backoff
499 is non-standard but used (nginx convention) to signal the client closed the connection before the response was sent.

New warning codes

CodeMeaning
low_snr_droppedAudio dropped before LID/ASR; SNR below floor. Response text is empty. No credits charged.
llm_refinement_not_in_planThe context field was provided but the active plan does not include LLM refinement. The field was ignored; transcription proceeded without refinement.
The same low_snr_dropped value appears as a processing_mode on WS transcription events — treat as “skip this utterance, no useful text”.

Billing changes

  • WS double-bill fix. Sessions that previously double-billed when the client sent stop are now billed correctly. billing_summary.total_duration_seconds will show as ~50 % of historical values for affected sessions.
  • billing_summary.client_estimated_seconds (new, diagnostic) — rough estimate of audio the client sent. Useful only for debugging duration drift; never display as a billed number.
  • Low-SNR refunds. REST sessions where upstream dropped audio for low SNR no longer charge the user. WS already handled this.
  • Cancellation propagation. Client disconnects mid-request now correctly abort the upstream call. No charge for cancelled requests (499 STT_CLIENT_CANCELLED).
  • Diarization surcharge documented. When diarize=true, requests incur a documented +30 % surcharge on top of the base STT rate to cover GPU diarization (pyannote).

Plan-tier gate

context (LLM refinement) is now a paid feature. On the Free plan the field is silently stripped server-side and the response includes warning_codes: ["llm_refinement_not_in_plan"]. Surface an upgrade prompt or hide / disable the input on Free plans for better UX.

Surface summary

  • RESTPOST /stt accepts new form fields (keywords, languages, min_speakers, max_speakers, return_timestamps, include_confidence, script_correction, min_split_sec); returns new fields (snr_db, language_source: "long_audio_chunked", per-word boosted/original/confidence, per-segment chunk_idx).
  • REST POST /tts, /tts-synthesize, /tts-stream — may now return 429 TTS_CONCURRENCY_LIMIT.
  • WebSocket /ws/stt — adds low_snr_dropped to processing_mode; per-word boosted/original; message-level snr_db; concurrency-limit error frame + close 1008; corrected billing_summary with new client_estimated_seconds field.
  • WebSocket /ws/tts — concurrency-limit error frame (legacy {error: {message, code, details}} shape) + close 1008.
  • Web UI — axios timeout raised to 600 s, error-code-aware toasts, warning_codes surfaced, WS 1008 distinguished from auth/network failures. The 25 MB upload cap is unchanged.
2026-04-14
LLM context refinement for STT

Added — Context-gated LLM refinement

Speech-to-Text now accepts an optional context hint that opens a server-side LLM refinement gate. When supplied, the response transcript is polished for proper-noun accuracy, filler removal, punctuation, and script consistency in mixed-language audio.The shape differs per transport:
Transportcontext shapeEnables
REST POST /stt (and /v1/transcribe)plain string — free-form paragraphLLM refinement of response text
WebSocket /v1/stream (and /ws/stt)structured object {general, text, terms}Two-phase canonical final flow

REST — context: string

Free-form paragraph describing the session (domain, speakers, jargon). Serialized as a multipart form field.
cURL
curl -X POST https://api.60db.ai/stt \
  -H "Authorization: Bearer $API_KEY" \
  -F "[email protected]" \
  -F "context=Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique, stamina, running between wickets."
JavaScript
await client.speechToText(file, {
  language: "auto",
  diarize: true,
  context: "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma.",
});
Python
client.speech_to_text(
    audio_file,
    language='auto',
    diarize=True,
    context='Cricket coaching session. Players: Arjun Mehta, Ishaan Verma.',
)
CLI
60db stt:transcribe --file meeting.wav \
  --context "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma."
MCP
{
  "audio_url": "https://example.com/meeting.wav",
  "context": "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma."
}

WebSocket — structured {general, text, terms}

Start message
{
  "type": "start",
  "languages": ["en", "hi"],
  "context": {
    "general": [
      { "key": "domain", "value": "Cricket coaching" },
      {
        "key": "players",
        "value": "Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan"
      }
    ],
    "text": "Coach reviewing a batting practice session.",
    "terms": ["Arjun Mehta", "Ishaan Verma", "off-side", "stamina", "wickets"]
  },
  "config": {
    "encoding": "linear",
    "sample_rate": 48000,
    "continuous_mode": true
  }
}

Changed — WebSocket two-phase canonical flow

When context is supplied, each utterance now produces two transcription events sharing a sentence_id:
  1. First emitis_final: true, speech_final: false — fast dict-corrected text. Use for low-latency UI paint and voicebot barge-in / NLU.
  2. Canonicalis_final: true, speech_final: true — definitive answer. Either LLM-refined (when llm_applied: true) or the original text re-emitted (when LLM was skipped or failed).
New canonical-only fields:
FieldTypeDescription
llm_appliedbooleantrue if the LLM ran, false if skipped / failed.
llm_latency_msnumberRound-trip to the LLM endpoint (SLA monitoring).
llm_reasonstringDiagnostic when llm_applied: false (gate_closed, error:TimeoutException, etc.).
Simple consumers can gate exclusively on speech_final: true and ignore the rest — one canonical event per utterance regardless of whether refinement is on.See the WebSocket STT Reference for the full table and reconciliation patterns.

Added — Word-preservation guardrails (proxy + UI)

To defend against over-aggressive LLM refinement and fast-speech hallucination rejection:
  • Refined word-retention guardrail. If the canonical text drops more than ~60% of the first-emit’s token count, the proxy rolls back to the first-emit text and marks llm_applied: false, llm_reason: "dropped_too_many_words". Tuned so legitimate polish (filler removal, 10–40% compression) passes through untouched.
  • Hallucination-rejected fallback. When the upstream server’s word-rate guard emits an empty final (processing_mode: "hallucination_rejected", common on fast-paced Indic/English mixed audio), the proxy upgrades it into a tentative canonical using the cached first-emit / last-interim text plus tentative: true and tentative_reason: "hallucination_rejected" so words never silently disappear.
Both guardrails run in the /ws/stt proxy, so voicebots and web clients inherit them automatically.

Added — Backward-compat refined event handling

Pre-migration upstream builds emit LLM refinement as a separate refined event rather than a second transcription. The proxy and web client transparently accept both shapes, so existing integrations keep working through the rollout.
If you see refined events in the wire trace, upstream workers haven’t been restarted onto the two-phase build yet — it’s still fully functional; the refined event is deprecated but accepted.

Surface summary

  • RESTPOST /stt, POST /v1/transcribecontext: string form field.
  • WebSocket/v1/stream, /ws/sttcontext: {general, text, terms} on start; two-phase canonical flow; llm_applied / llm_latency_ms / llm_reason on canonical.
  • JavaScript SDKclient.speechToText(audio, { context }).
  • Python SDKclient.speech_to_text(audio_file, context=...).
  • CLI60db stt:transcribe --context "...".
  • MCP Serversixtydb_stt_transcribe accepts context.
  • Proxy — word-retention guardrail, hallucination fallback, first-emit caching, legacy refined shim.
  • Web UI — context input in Speech-to-Text page and realtime demo; segments rendered with tentative marker when flagged.

Environment knobs (server-side)

VariableDefaultPurpose
STT_LLM_ENABLEDtrueMaster kill-switch for refinement.
STT_LLM_MODEL60db-tinyOpenAI-compatible model identifier.
STT_LLM_TIMEOUT_SEC10.0Per-call timeout — on timeout, canonical falls back to original.
STT_LLM_MIN_WORDS4Skip refinement for tiny utterances ("Yeah", "Okay").
STT_WS_HALLUCINATION_WPS8.0Word-rate ceiling; finals above it are flagged as hallucination_rejected.