Added — Context-gated LLM refinement
Speech-to-Text now accepts an optional context hint that opens a server-side LLM refinement gate. When supplied, the response transcript is polished for proper-noun accuracy, filler removal, punctuation, and script consistency in mixed-language audio.The shape differs per transport:| Transport | context shape | Enables |
|---|---|---|
REST POST /stt (and /v1/transcribe) | plain string — free-form paragraph | LLM refinement of response text |
WebSocket /v1/stream (and /ws/stt) | structured object {general, text, terms} | Two-phase canonical final flow |
REST — context: string
Free-form paragraph describing the session (domain, speakers, jargon). Serialized as a multipart form field.cURL
JavaScript
Python
CLI
MCP
WebSocket — structured {general, text, terms}
Start message
Changed — WebSocket two-phase canonical flow
Whencontext is supplied, each utterance now produces two transcription events sharing a sentence_id:- First emit —
is_final: true, speech_final: false— fast dict-corrected text. Use for low-latency UI paint and voicebot barge-in / NLU. - Canonical —
is_final: true, speech_final: true— definitive answer. Either LLM-refined (whenllm_applied: true) or the original text re-emitted (when LLM was skipped or failed).
| Field | Type | Description |
|---|---|---|
llm_applied | boolean | true if the LLM ran, false if skipped / failed. |
llm_latency_ms | number | Round-trip to the LLM endpoint (SLA monitoring). |
llm_reason | string | Diagnostic when llm_applied: false (gate_closed, error:TimeoutException, etc.). |
speech_final: true and ignore the rest — one canonical event per utterance regardless of whether refinement is on.See the WebSocket STT Reference for the full table and reconciliation patterns.Added — Word-preservation guardrails (proxy + UI)
To defend against over-aggressive LLM refinement and fast-speech hallucination rejection:- Refined word-retention guardrail. If the canonical text drops more than ~60% of the first-emit’s token count, the proxy rolls back to the first-emit text and marks
llm_applied: false, llm_reason: "dropped_too_many_words". Tuned so legitimate polish (filler removal, 10–40% compression) passes through untouched. - Hallucination-rejected fallback. When the upstream server’s word-rate guard emits an empty final (
processing_mode: "hallucination_rejected", common on fast-paced Indic/English mixed audio), the proxy upgrades it into a tentative canonical using the cached first-emit / last-interim text plustentative: trueandtentative_reason: "hallucination_rejected"so words never silently disappear.
/ws/stt proxy, so voicebots and web clients inherit them automatically.Added — Backward-compat refined event handling
Pre-migration upstream builds emit LLM refinement as a separate refined event rather than a second transcription. The proxy and web client transparently accept both shapes, so existing integrations keep working through the rollout.If you see
refined events in the wire trace, upstream workers haven’t been
restarted onto the two-phase build yet — it’s still fully functional; the
refined event is deprecated but accepted.Surface summary
- REST —
POST /stt,POST /v1/transcribe—context: stringform field. - WebSocket —
/v1/stream,/ws/stt—context: {general, text, terms}onstart; two-phase canonical flow;llm_applied/llm_latency_ms/llm_reasonon canonical. - JavaScript SDK —
client.speechToText(audio, { context }). - Python SDK —
client.speech_to_text(audio_file, context=...). - CLI —
60db stt:transcribe --context "...". - MCP Server —
sixtydb_stt_transcribeacceptscontext. - Proxy — word-retention guardrail, hallucination fallback, first-emit caching, legacy
refinedshim. - Web UI — context input in Speech-to-Text page and realtime demo; segments rendered with
tentativemarker when flagged.
Environment knobs (server-side)
| Variable | Default | Purpose |
|---|---|---|
STT_LLM_ENABLED | true | Master kill-switch for refinement. |
STT_LLM_MODEL | 60db-tiny | OpenAI-compatible model identifier. |
STT_LLM_TIMEOUT_SEC | 10.0 | Per-call timeout — on timeout, canonical falls back to original. |
STT_LLM_MIN_WORDS | 4 | Skip refinement for tiny utterances ("Yeah", "Okay"). |
STT_WS_HALLUCINATION_WPS | 8.0 | Word-rate ceiling; finals above it are flagged as hallucination_rejected. |