Skip to main content
Every product surface shares this page. Dated entries flag the surfaces they touched (REST, WebSocket, SDK, CLI, MCP, Proxy) so you can skim by what you integrate with.
2026-04-14
LLM context refinement for STT

Added — Context-gated LLM refinement

Speech-to-Text now accepts an optional context hint that opens a server-side LLM refinement gate. When supplied, the response transcript is polished for proper-noun accuracy, filler removal, punctuation, and script consistency in mixed-language audio.The shape differs per transport:
Transportcontext shapeEnables
REST POST /stt (and /v1/transcribe)plain string — free-form paragraphLLM refinement of response text
WebSocket /v1/stream (and /ws/stt)structured object {general, text, terms}Two-phase canonical final flow

REST — context: string

Free-form paragraph describing the session (domain, speakers, jargon). Serialized as a multipart form field.
cURL
curl -X POST https://api.60db.ai/stt \
  -H "Authorization: Bearer $API_KEY" \
  -F "[email protected]" \
  -F "context=Cricket coaching session. Players: Arjun Mehta, Ishaan Verma. Discussing batting technique, stamina, running between wickets."
JavaScript
await client.speechToText(file, {
  language: "auto",
  diarize: true,
  context: "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma.",
});
Python
client.speech_to_text(
    audio_file,
    language='auto',
    diarize=True,
    context='Cricket coaching session. Players: Arjun Mehta, Ishaan Verma.',
)
CLI
60db stt:transcribe --file meeting.wav \
  --context "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma."
MCP
{
  "audio_url": "https://example.com/meeting.wav",
  "context": "Cricket coaching session. Players: Arjun Mehta, Ishaan Verma."
}

WebSocket — structured {general, text, terms}

Start message
{
  "type": "start",
  "languages": ["en", "hi"],
  "context": {
    "general": [
      { "key": "domain", "value": "Cricket coaching" },
      {
        "key": "players",
        "value": "Arjun Mehta, Ishaan Verma, Aryan Khan, Rohan"
      }
    ],
    "text": "Coach reviewing a batting practice session.",
    "terms": ["Arjun Mehta", "Ishaan Verma", "off-side", "stamina", "wickets"]
  },
  "config": {
    "encoding": "linear",
    "sample_rate": 48000,
    "continuous_mode": true
  }
}

Changed — WebSocket two-phase canonical flow

When context is supplied, each utterance now produces two transcription events sharing a sentence_id:
  1. First emitis_final: true, speech_final: false — fast dict-corrected text. Use for low-latency UI paint and voicebot barge-in / NLU.
  2. Canonicalis_final: true, speech_final: true — definitive answer. Either LLM-refined (when llm_applied: true) or the original text re-emitted (when LLM was skipped or failed).
New canonical-only fields:
FieldTypeDescription
llm_appliedbooleantrue if the LLM ran, false if skipped / failed.
llm_latency_msnumberRound-trip to the LLM endpoint (SLA monitoring).
llm_reasonstringDiagnostic when llm_applied: false (gate_closed, error:TimeoutException, etc.).
Simple consumers can gate exclusively on speech_final: true and ignore the rest — one canonical event per utterance regardless of whether refinement is on.See the WebSocket STT Reference for the full table and reconciliation patterns.

Added — Word-preservation guardrails (proxy + UI)

To defend against over-aggressive LLM refinement and fast-speech hallucination rejection:
  • Refined word-retention guardrail. If the canonical text drops more than ~60% of the first-emit’s token count, the proxy rolls back to the first-emit text and marks llm_applied: false, llm_reason: "dropped_too_many_words". Tuned so legitimate polish (filler removal, 10–40% compression) passes through untouched.
  • Hallucination-rejected fallback. When the upstream server’s word-rate guard emits an empty final (processing_mode: "hallucination_rejected", common on fast-paced Indic/English mixed audio), the proxy upgrades it into a tentative canonical using the cached first-emit / last-interim text plus tentative: true and tentative_reason: "hallucination_rejected" so words never silently disappear.
Both guardrails run in the /ws/stt proxy, so voicebots and web clients inherit them automatically.

Added — Backward-compat refined event handling

Pre-migration upstream builds emit LLM refinement as a separate refined event rather than a second transcription. The proxy and web client transparently accept both shapes, so existing integrations keep working through the rollout.
If you see refined events in the wire trace, upstream workers haven’t been restarted onto the two-phase build yet — it’s still fully functional; the refined event is deprecated but accepted.

Surface summary

  • RESTPOST /stt, POST /v1/transcribecontext: string form field.
  • WebSocket/v1/stream, /ws/sttcontext: {general, text, terms} on start; two-phase canonical flow; llm_applied / llm_latency_ms / llm_reason on canonical.
  • JavaScript SDKclient.speechToText(audio, { context }).
  • Python SDKclient.speech_to_text(audio_file, context=...).
  • CLI60db stt:transcribe --context "...".
  • MCP Serversixtydb_stt_transcribe accepts context.
  • Proxy — word-retention guardrail, hallucination fallback, first-emit caching, legacy refined shim.
  • Web UI — context input in Speech-to-Text page and realtime demo; segments rendered with tentative marker when flagged.

Environment knobs (server-side)

VariableDefaultPurpose
STT_LLM_ENABLEDtrueMaster kill-switch for refinement.
STT_LLM_MODEL60db-tinyOpenAI-compatible model identifier.
STT_LLM_TIMEOUT_SEC10.0Per-call timeout — on timeout, canonical falls back to original.
STT_LLM_MIN_WORDS4Skip refinement for tiny utterances ("Yeah", "Okay").
STT_WS_HALLUCINATION_WPS8.0Word-rate ceiling; finals above it are flagged as hallucination_rejected.