TTS WebSocket
Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.
Endpoint
Authentication
Query parameter authentication:
Examples:
ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key
ws://api.60db.ai/ws/tts?token=eyJ...&workspace_id=24
The WebSocket connection checks workspace wallet balance before starting a session. If the workspace has insufficient credits, the connection is closed with a 1008 status code and an INSUFFICIENT_CREDITS error.
Protocol Overview
Client Server
| |
|─── create_context ──────────────────▶ |
|◀── context_created ───────────────── |
| |
|─── send_text ───────────────────────▶ |
|─── send_text ───────────────────────▶ |
|─── flush_context ───────────────────▶ |
|◀── audio_chunk #1 ────────────────── |
|◀── audio_chunk #2 ────────────────── |
|◀── audio_chunk #N ────────────────── |
|◀── flush_completed ───────────────── |
| |
|─── close_context ───────────────────▶ |
|◀── context_closed ────────────────── |
| (connection closes) |
Connection Sequence
1. Connect
const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');
2. Receive Authentication Message
{
"connecting": true,
"message": "Authenticating...",
"timestamp": 1775465918269
}
3. Receive Connection Established
{
"connection_established": {
"service": "tts",
"user_id": 43,
"credit_balance": 9.97,
"workspace": "default"
}
}
Fields:
Client → Server Messages
1. create_context
Must be the first message. Initializes the TTS session with voice and audio settings.
{
"create_context": {
"context_id": "my-session-123",
"voice_id": "7911a3e8",
"audio_config": {
"audio_encoding": "LINEAR16",
"sample_rate_hertz": 16000
},
"speed": 1,
"stability": 50,
"similarity": 75
}
}
Parameters:
Supported encoding + sample rate combinations:
Not all combinations are valid. The table below shows which pairs are supported. Unsupported combinations silently fall back to LINEAR16 at 16000 Hz.
audio_encoding | Supported sample_rate_hertz | Output format |
|---|
LINEAR16 | 8000, 16000 (default), 24000, 48000 | Raw PCM, 16-bit signed little-endian, mono |
PCM | 8000, 16000 (default), 24000, 48000 | Same as LINEAR16 |
MULAW | 8000 | G.711 μ-law encoded, mono |
ULAW | 8000 | Same as MULAW |
OGG_OPUS | 24000 | Ogg Opus compressed audio |
Note: MULAW/ULAW only works at 8000 Hz. Using other sample rates with MULAW falls back to LINEAR16 @ 16kHz. Similarly, OGG_OPUS only works at 24000 Hz.
Limits:
| Parameter | Min | Max | Default | Behavior when out of range |
|---|
speed | 0.5 | 2.0 | 1 | Silently clamped |
stability | 0 | 100 | 50 | Silently clamped |
similarity | 0 | 100 | 75 | Silently clamped |
text (per send_text) | 1 char | — | — | Empty text is ignored |
| text buffer (accumulated) | — | 50,000 chars | — | Error returned if exceeded |
2. send_text
Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.
{
"send_text": {
"context_id": "my-session-123",
"text": "Hello, how are you doing today?"
}
}
Fields:
You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):
{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}
Text chunking behavior:
Long text is automatically split into sentence-based chunks for reliable synthesis. The model works best with 3–30 second utterances. The server handles:
- Sentence boundary detection for natural chunk splits
- Newline characters (
\n) are treated as hard paragraph boundaries
- Mixed-language text (e.g., English + Hindi) is chunked per paragraph to prevent early EOS
3. flush_context
Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed (only on success — if synthesis fails, an error message is sent instead with no flush_completed).
{
"flush_context": {
"context_id": "my-session-123"
}
}
4. close_context
Flushes any remaining text, sends final audio, and closes the WebSocket connection.
{
"close_context": {
"context_id": "my-session-123"
}
}
Server → Client Messages
context_created
Confirms the session was initialized successfully.
{
"context_created": {
"context_id": "my-session-123"
}
}
audio_chunk
Contains a chunk of synthesized audio. Multiple chunks are sent per flush. Each chunk is streamed as soon as it’s decoded for minimum latency.
{
"audio_chunk": {
"context_id": "my-session-123",
"audioContent": "SGVsbG8gd29ybGQ..."
}
}
Fields:
Base64-encoded audio bytes
The audio encoding and chunk format depend on audio_config:
| Encoding | Chunk format | Notes |
|---|
LINEAR16 / PCM | Raw PCM, 16-bit signed LE, mono | Chunks can be concatenated directly |
MULAW / ULAW | G.711 μ-law, 8-bit, mono | Chunks can be concatenated directly |
OGG_OPUS | Independent Ogg Opus files | Each chunk is a self-contained OGG file. Chunks cannot be naively concatenated — decode each independently or use LINEAR16 for concatenatable streaming |
flush_completed
Signals that all audio for the flushed text has been sent. Only sent on successful synthesis — if synthesis fails, an error is sent instead.
{
"flush_completed": {
"context_id": "my-session-123"
}
}
context_closed
Confirms the session is closed. The WebSocket connection closes after this message.
{
"context_closed": {
"context_id": "my-session-123"
}
}
error
Sent if synthesis fails or a protocol violation occurs.
{
"error": {
"context_id": "my-session-123",
"message": "voice_id required"
}
}
Common errors:
| Message | Cause |
|---|
voice_id required | create_context sent without voice_id |
text_buffer exceeded 50000 character limit | Too much text accumulated without flushing |
Unsupported audio_encoding: X | Invalid encoding value |
Unsupported sample_rate_hertz: X | Invalid sample rate |
Complete Example
Real-time Playback (Browser)
For low-latency playback as chunks arrive (instead of waiting for all chunks), use the Web Audio API with scheduled AudioBufferSourceNode:
let audioCtx;
let nextPlayTime = 0;
function onAudioChunk(base64Audio, sampleRate) {
const binary = atob(base64Audio);
const bytes = new Uint8Array(binary.length);
for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
// Decode PCM int16 LE → Float32
const int16 = new Int16Array(bytes.buffer);
const float32 = new Float32Array(int16.length);
for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;
if (!audioCtx) audioCtx = new AudioContext({ sampleRate });
const buf = audioCtx.createBuffer(1, float32.length, sampleRate);
buf.getChannelData(0).set(float32);
const source = audioCtx.createBufferSource();
source.buffer = buf;
source.connect(audioCtx.destination);
const now = audioCtx.currentTime;
// First chunk: 150ms pre-buffer to absorb network jitter
// Late chunks: schedule 20ms ahead for minimal gap
if (nextPlayTime <= now) {
nextPlayTime = now + (nextPlayTime === 0 ? 0.15 : 0.02);
}
source.start(nextPlayTime);
nextPlayTime += float32.length / sampleRate;
}
LLM Integration Pattern
When streaming tokens from an LLM into TTS:
import json
import base64
import websockets
async def llm_to_tts(llm_stream, voice_id):
API_KEY = "sk_live_your_key"
url = f"ws://api.60db.ai/ws/tts?apiKey={API_KEY}"
async with websockets.connect(url) as ws:
# Wait for connection
await ws.recv() # connection_established
# Create context
await ws.send(json.dumps({
"create_context": {
"context_id": "llm-session",
"voice_id": voice_id,
"audio_config": {"audio_encoding": "MULAW", "sample_rate_hertz": 8000}
}
}))
await ws.recv() # context_created
# Stream LLM tokens as text chunks
async for token in llm_stream:
await ws.send(json.dumps({
"send_text": {"context_id": "llm-session", "text": token}
}))
# Flush + close when LLM is done
await ws.send(json.dumps({
"flush_context": {"context_id": "llm-session"}
}))
audio = b""
while True:
msg = json.loads(await ws.recv())
if "audio_chunk" in msg:
audio += base64.b64decode(msg["audio_chunk"]["audioContent"])
elif "flush_completed" in msg:
break
elif "error" in msg:
raise RuntimeError(msg["error"]["message"])
await ws.send(json.dumps({
"close_context": {"context_id": "llm-session"}
}))
await ws.recv() # context_closed
return audio
| Encoding | Format | Chunk behavior | Best for |
|---|
LINEAR16 | Raw PCM, 16-bit signed LE, mono | Concatenatable | General purpose, highest quality |
MULAW | G.711 μ-law, 8kHz, mono | Concatenatable | Telephony (Twilio, SIP) |
OGG_OPUS | Ogg Opus compressed, 24kHz | NOT concatenatable — each chunk is a standalone OGG file | Web playback, bandwidth-constrained |
For telephony integration (Twilio, etc.), use MULAW at 8000 Hz:
"audio_config": {
"audio_encoding": "MULAW",
"sample_rate_hertz": 8000
}
For web playback with low bandwidth, use OGG_OPUS at 24000 Hz:
"audio_config": {
"audio_encoding": "OGG_OPUS",
"sample_rate_hertz": 24000
}
Important: OGG_OPUS chunks are individually wrapped OGG files. To merge for download, decode each chunk independently (e.g., via AudioContext.decodeAudioData()) and concatenate the PCM output. Do not concatenate raw OGG bytes.
Supported Languages
The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text — no explicit language parameter is needed.
| Language | ID |
|---|
| English | en |
| Hindi | hi |
| Bengali | bn |
| Gujarati | gu |
| Kannada | kn |
| Malayalam | ml |
| Marathi | mr |
| Punjabi | pa |
| Tamil | ta |
| Telugu | te |
| Assamese | as |
| Odia | or |
Mixed-language text is supported. Use newlines (\n) to separate paragraphs in different languages for best results.
Default Voice
The default voice ID is:
fbb75ed2-975a-40c7-9e06-38e30524a9a1
To get more voices, use the Voices API.
Context Management
Reuse Context
Keep a context open for multiple syntheses:
// Create once
ws.send(JSON.stringify({
create_context: { context_id, voice_id, audio_config }
}));
// Send multiple texts
ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));
ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));
// Close when done
ws.send(JSON.stringify({ close_context: { context_id } }));
Multiple Contexts
You can create multiple contexts in one connection:
const context1 = 'ctx-1';
const context2 = 'ctx-2';
// Create both contexts
ws.send(JSON.stringify({
create_context: {
context_id: context1,
voice_id: voice1,
audio_config
}
}));
ws.send(JSON.stringify({
create_context: {
context_id: context2,
voice_id: voice2,
audio_config
}
}));
Pricing
- Rate: $0.00002 per character
- Minimum: $0.01 per context
- Billing: Per character synthesized
Error Codes
| Code | Description |
|---|
| 1008 | Authentication failed |
| 1008 | Insufficient credits |
| 1011 | Voice not found |
| 1011 | Invalid audio config |
| 1006 | Connection lost |
Testing
# Install wscat
npm install -g wscat
# Test TTS
wscat -c "ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key"
Then send:
{"create_context":{"context_id":"test-123","voice_id":"fbb75ed2-975a-40c7-9e06-38e30524a9a1","audio_config":{"audio_encoding":"LINEAR16","sample_rate_hertz":16000}}}
{"send_text":{"context_id":"test-123","text":"Hello"}}
{"flush_context":{"context_id":"test-123"}}