TTS WebSocket

Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.

Endpoint

Authentication

Query parameter authentication: Examples:

ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key
ws://api.60db.ai/ws/tts?token=eyJ...&workspace_id=24

The WebSocket connection checks workspace wallet balance before starting a session. If the workspace has insufficient credits, the connection is closed with a 1008 status code and an INSUFFICIENT_CREDITS error.

Protocol Overview

Client                                  Server
  |                                       |
  |─── create_context ──────────────────▶ |
  |◀── context_created ─────────────────  |
  |                                       |
  |─── send_text ───────────────────────▶ |
  |─── send_text ───────────────────────▶ |
  |─── flush_context ───────────────────▶ |
  |◀── audio_chunk #1 ──────────────────  |
  |◀── audio_chunk #2 ──────────────────  |
  |◀── audio_chunk #N ──────────────────  |
  |◀── flush_completed ─────────────────  |
  |                                       |
  |─── close_context ───────────────────▶ |
  |◀── context_closed ──────────────────  |
  |          (connection closes)          |

Connection Sequence

1. Connect

const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');

2. Receive Authentication Message

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

3. Receive Connection Established

{
  "connection_established": {
    "service": "tts",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}

Fields:

service

string

Service name: "tts"

user_id

integer

Your user ID

credit_balance

number

Available credits

workspace

string

Workspace name

Client → Server Messages

1. create_context

Must be the first message. Initializes the TTS session with voice and audio settings.

{
  "create_context": {
    "context_id": "my-session-123",
    "voice_id": "7911a3e8",
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "sample_rate_hertz": 16000
    },
    "speed": 1,
    "stability": 50,
    "similarity": 75
  }
}

Parameters: Supported encoding + sample rate combinations: Not all combinations are valid. The table below shows which pairs are supported. Unsupported combinations silently fall back to LINEAR16 at 16000 Hz.

`audio_encoding`	Supported `sample_rate_hertz`	Output format
`LINEAR16`	`8000`, `16000` (default), `24000`, `48000`	Raw PCM, 16-bit signed little-endian, mono
`PCM`	`8000`, `16000` (default), `24000`, `48000`	Same as LINEAR16
`MULAW`	`8000`	G.711 μ-law encoded, mono
`ULAW`	`8000`	Same as MULAW
`OGG_OPUS`	`24000`	Ogg Opus compressed audio

Note: MULAW/ULAW only works at 8000 Hz. Using other sample rates with MULAW falls back to LINEAR16 @ 16kHz. Similarly, OGG_OPUS only works at 24000 Hz.

Limits:

Parameter	Min	Max	Default	Behavior when out of range
`speed`	0.5	2.0	1	Silently clamped
`stability`	0	100	50	Silently clamped
`similarity`	0	100	75	Silently clamped
`text` (per send_text)	1 char	—	—	Empty text is ignored
text buffer (accumulated)	—	50,000 chars	—	Error returned if exceeded

2. send_text

Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.

{
  "send_text": {
    "context_id": "my-session-123",
    "text": "Hello, how are you doing today?"
  }
}

Fields: You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):

{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}

Text chunking behavior: Long text is automatically split into sentence-based chunks for reliable synthesis. The model works best with 3–30 second utterances. The server handles:

Sentence boundary detection for natural chunk splits
Newline characters (\n) are treated as hard paragraph boundaries
Mixed-language text (e.g., English + Hindi) is chunked per paragraph to prevent early EOS

3. flush_context

Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed (only on success — if synthesis fails, an error message is sent instead with no flush_completed).

{
  "flush_context": {
    "context_id": "my-session-123"
  }
}

4. close_context

Flushes any remaining text, sends final audio, and closes the WebSocket connection.

{
  "close_context": {
    "context_id": "my-session-123"
  }
}

Server → Client Messages

context_created

Confirms the session was initialized successfully.

{
  "context_created": {
    "context_id": "my-session-123"
  }
}

audio_chunk

Contains a chunk of synthesized audio. Multiple chunks are sent per flush. Each chunk is streamed as soon as it’s decoded for minimum latency.

{
  "audio_chunk": {
    "context_id": "my-session-123",
    "audioContent": "SGVsbG8gd29ybGQ..."
  }
}

Fields:

context_id

string

Session identifier

audioContent

string

Base64-encoded audio bytes

The audio encoding and chunk format depend on audio_config:

Encoding	Chunk format	Notes
`LINEAR16` / `PCM`	Raw PCM, 16-bit signed LE, mono	Chunks can be concatenated directly
`MULAW` / `ULAW`	G.711 μ-law, 8-bit, mono	Chunks can be concatenated directly
`OGG_OPUS`	Independent Ogg Opus files	Each chunk is a self-contained OGG file. Chunks cannot be naively concatenated — decode each independently or use LINEAR16 for concatenatable streaming

flush_completed

Signals that all audio for the flushed text has been sent. Only sent on successful synthesis — if synthesis fails, an error is sent instead.

{
  "flush_completed": {
    "context_id": "my-session-123"
  }
}

context_closed

Confirms the session is closed. The WebSocket connection closes after this message.

{
  "context_closed": {
    "context_id": "my-session-123"
  }
}

error

Sent if synthesis fails or a protocol violation occurs.

{
  "error": {
    "context_id": "my-session-123",
    "message": "voice_id required"
  }
}

Common errors:

Message	Cause
`voice_id required`	`create_context` sent without `voice_id`
`text_buffer exceeded 50000 character limit`	Too much text accumulated without flushing
`Unsupported audio_encoding: X`	Invalid encoding value
`Unsupported sample_rate_hertz: X`	Invalid sample rate

Concurrency-limit error frame

When a user has reached their per-user TTS session cap (counted across REST + WS combined), the server sends an error frame and closes with code 1008:

{
  "error": {
    "message": "Too many concurrent TTS sessions for this user",
    "code": "TTS_CONCURRENCY_LIMIT",
    "details": { "limit": 5 }
  }
}

STT vs TTS error-frame shape mismatch. The TTS WS uses the legacy shape {error: {message, code, details}}. The STT WS uses {type: "error", error, error_code, details}. SDK / client code that reads both connections must branch on the connection it’s reading from. This inconsistency is pre-existing in the WS handlers, not new.

Existing sessions are unaffected; the cap releases when an in-flight session ends. Do not auto-reconnect on 1008 — the limit only frees when an in-flight session completes.

Complete Example

Real-time Playback (Browser)

For low-latency playback as chunks arrive (instead of waiting for all chunks), use the Web Audio API with scheduled AudioBufferSourceNode:

let audioCtx;
let nextPlayTime = 0;

function onAudioChunk(base64Audio, sampleRate) {
  const binary = atob(base64Audio);
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);

  // Decode PCM int16 LE → Float32
  const int16 = new Int16Array(bytes.buffer);
  const float32 = new Float32Array(int16.length);
  for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;

  if (!audioCtx) audioCtx = new AudioContext({ sampleRate });
  const buf = audioCtx.createBuffer(1, float32.length, sampleRate);
  buf.getChannelData(0).set(float32);
  const source = audioCtx.createBufferSource();
  source.buffer = buf;
  source.connect(audioCtx.destination);

  const now = audioCtx.currentTime;
  // First chunk: 150ms pre-buffer to absorb network jitter
  // Late chunks: schedule 20ms ahead for minimal gap
  if (nextPlayTime <= now) {
    nextPlayTime = now + (nextPlayTime === 0 ? 0.15 : 0.02);
  }
  source.start(nextPlayTime);
  nextPlayTime += float32.length / sampleRate;
}

LLM Integration Pattern

When streaming tokens from an LLM into TTS:

import json
import base64
import websockets

async def llm_to_tts(llm_stream, voice_id):
    API_KEY = "sk_live_your_key"
    url = f"ws://api.60db.ai/ws/tts?apiKey={API_KEY}"

    async with websockets.connect(url) as ws:
        # Wait for connection
        await ws.recv()  # connection_established

        # Create context
        await ws.send(json.dumps({
            "create_context": {
                "context_id": "llm-session",
                "voice_id": voice_id,
                "audio_config": {"audio_encoding": "MULAW", "sample_rate_hertz": 8000}
            }
        }))
        await ws.recv()  # context_created

        # Stream LLM tokens as text chunks
        async for token in llm_stream:
            await ws.send(json.dumps({
                "send_text": {"context_id": "llm-session", "text": token}
            }))

        # Flush + close when LLM is done
        await ws.send(json.dumps({
            "flush_context": {"context_id": "llm-session"}
        }))

        audio = b""
        while True:
            msg = json.loads(await ws.recv())
            if "audio_chunk" in msg:
                audio += base64.b64decode(msg["audio_chunk"]["audioContent"])
            elif "flush_completed" in msg:
                break
            elif "error" in msg:
                raise RuntimeError(msg["error"]["message"])

        await ws.send(json.dumps({
            "close_context": {"context_id": "llm-session"}
        }))
        await ws.recv()  # context_closed

        return audio

Audio Format Notes

Encoding	Format	Chunk behavior	Best for
`LINEAR16`	Raw PCM, 16-bit signed LE, mono	Concatenatable	General purpose, highest quality
`MULAW`	G.711 μ-law, 8kHz, mono	Concatenatable	Telephony (Twilio, SIP)
`OGG_OPUS`	Ogg Opus compressed, 24kHz	NOT concatenatable — each chunk is a standalone OGG file	Web playback, bandwidth-constrained

For telephony integration (Twilio, etc.), use MULAW at 8000 Hz:

"audio_config": {
  "audio_encoding": "MULAW",
  "sample_rate_hertz": 8000
}

For web playback with low bandwidth, use OGG_OPUS at 24000 Hz:

"audio_config": {
  "audio_encoding": "OGG_OPUS",
  "sample_rate_hertz": 24000
}

Important: OGG_OPUS chunks are individually wrapped OGG files. To merge for download, decode each chunk independently (e.g., via AudioContext.decodeAudioData()) and concatenate the PCM output. Do not concatenate raw OGG bytes.

Supported Languages

The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text — no explicit language parameter is needed.

Language	ID
English	en
Hindi	hi
Bengali	bn
Gujarati	gu
Kannada	kn
Malayalam	ml
Marathi	mr
Punjabi	pa
Tamil	ta
Telugu	te
Assamese	as
Odia	or

Mixed-language text is supported. Use newlines (\n) to separate paragraphs in different languages for best results.

Default Voice

The default voice ID is:

fbb75ed2-975a-40c7-9e06-38e30524a9a1

To get more voices, use the Voices API.

Context Management

Reuse Context

Keep a context open for multiple syntheses:

// Create once
ws.send(JSON.stringify({
  create_context: { context_id, voice_id, audio_config }
}));

// Send multiple texts
ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));

ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));

// Close when done
ws.send(JSON.stringify({ close_context: { context_id } }));

Multiple Contexts

You can create multiple contexts in one connection:

const context1 = 'ctx-1';
const context2 = 'ctx-2';

// Create both contexts
ws.send(JSON.stringify({
  create_context: {
    context_id: context1,
    voice_id: voice1,
    audio_config
  }
}));

ws.send(JSON.stringify({
  create_context: {
    context_id: context2,
    voice_id: voice2,
    audio_config
  }
}));

Pricing

Rate: $0.00002 per character
Minimum: $0.01 per context
Billing: Per character synthesized

Error Codes

Close code	`code` (in preceding error frame)	Description
1008	`UNAUTHENTICATED`	Authentication failed
1008	`INSUFFICIENT_CREDITS`	Workspace wallet has no funds
1008	`TTS_CONCURRENCY_LIMIT`	Per-user concurrency cap reached. `details.limit` carries the active cap. Do not auto-reconnect; cap releases when an in-flight session ends.
1011	—	Voice not found / invalid audio config
1006	—	Connection lost

Testing

# Install wscat
npm install -g wscat

# Test TTS
wscat -c "ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key"

Then send:

{"create_context":{"context_id":"test-123","voice_id":"fbb75ed2-975a-40c7-9e06-38e30524a9a1","audio_config":{"audio_encoding":"LINEAR16","sample_rate_hertz":16000}}}
{"send_text":{"context_id":"test-123","text":"Hello"}}
{"flush_context":{"context_id":"test-123"}}

STT WebSocket - Speech-to-Text endpoint
Voices API - Get available voices
WebSocket API Reference - Complete documentation

​TTS WebSocket

​Endpoint

​Authentication

​Protocol Overview

​Connection Sequence

​1. Connect

​2. Receive Authentication Message

​3. Receive Connection Established

​Client → Server Messages

​1. create_context

​2. send_text

​3. flush_context

​4. close_context

​Server → Client Messages

​context_created

​audio_chunk

​flush_completed

​context_closed

​error

​Concurrency-limit error frame

​Complete Example

​Real-time Playback (Browser)

​LLM Integration Pattern

​Audio Format Notes

​Supported Languages

​Default Voice

​Context Management

​Reuse Context

​Multiple Contexts

​Pricing

​Error Codes

​Testing

​Related

TTS WebSocket

Endpoint

Authentication

Protocol Overview

Connection Sequence

1. Connect

2. Receive Authentication Message

3. Receive Connection Established

Client → Server Messages

1. create_context

2. send_text

3. flush_context

4. close_context

Server → Client Messages

context_created

audio_chunk

flush_completed

context_closed

error

Concurrency-limit error frame

Complete Example

Real-time Playback (Browser)

LLM Integration Pattern

Audio Format Notes

Supported Languages

Default Voice

Context Management

Reuse Context

Multiple Contexts

Pricing

Error Codes

Testing

Related