Skip to main content

TTS WebSocket

Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.

Endpoint

Authentication

Query parameter authentication: Examples:
ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key
ws://api.60db.ai/ws/tts?token=eyJ...&workspace_id=24
The WebSocket connection checks workspace wallet balance before starting a session. If the workspace has insufficient credits, the connection is closed with a 1008 status code and an INSUFFICIENT_CREDITS error.

Protocol Overview

Client                                  Server
  |                                       |
  |─── create_context ──────────────────▶ |
  |◀── context_created ─────────────────  |
  |                                       |
  |─── send_text ───────────────────────▶ |
  |─── send_text ───────────────────────▶ |
  |─── flush_context ───────────────────▶ |
  |◀── audio_chunk #1 ──────────────────  |
  |◀── audio_chunk #2 ──────────────────  |
  |◀── audio_chunk #N ──────────────────  |
  |◀── flush_completed ─────────────────  |
  |                                       |
  |─── close_context ───────────────────▶ |
  |◀── context_closed ──────────────────  |
  |          (connection closes)          |

Connection Sequence

1. Connect

const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');

2. Receive Authentication Message

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

3. Receive Connection Established

{
  "connection_established": {
    "service": "tts",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}
Fields:
service
string
Service name: "tts"
user_id
integer
Your user ID
credit_balance
number
Available credits
workspace
string
Workspace name

Client → Server Messages

1. create_context

Must be the first message. Initializes the TTS session with voice and audio settings.
{
  "create_context": {
    "context_id": "my-session-123",
    "voice_id": "7911a3e8",
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "sample_rate_hertz": 16000
    },
    "speed": 1,
    "stability": 50,
    "similarity": 75
  }
}
Parameters: Supported encoding + sample rate combinations: Not all combinations are valid. The table below shows which pairs are supported. Unsupported combinations silently fall back to LINEAR16 at 16000 Hz.
audio_encodingSupported sample_rate_hertzOutput format
LINEAR168000, 16000 (default), 24000, 48000Raw PCM, 16-bit signed little-endian, mono
PCM8000, 16000 (default), 24000, 48000Same as LINEAR16
MULAW8000G.711 μ-law encoded, mono
ULAW8000Same as MULAW
OGG_OPUS24000Ogg Opus compressed audio
Note: MULAW/ULAW only works at 8000 Hz. Using other sample rates with MULAW falls back to LINEAR16 @ 16kHz. Similarly, OGG_OPUS only works at 24000 Hz.
Limits:
ParameterMinMaxDefaultBehavior when out of range
speed0.52.01Silently clamped
stability010050Silently clamped
similarity010075Silently clamped
text (per send_text)1 charEmpty text is ignored
text buffer (accumulated)50,000 charsError returned if exceeded

2. send_text

Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.
{
  "send_text": {
    "context_id": "my-session-123",
    "text": "Hello, how are you doing today?"
  }
}
Fields: You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):
{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}
Text chunking behavior: Long text is automatically split into sentence-based chunks for reliable synthesis. The model works best with 3–30 second utterances. The server handles:
  • Sentence boundary detection for natural chunk splits
  • Newline characters (\n) are treated as hard paragraph boundaries
  • Mixed-language text (e.g., English + Hindi) is chunked per paragraph to prevent early EOS

3. flush_context

Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed (only on success — if synthesis fails, an error message is sent instead with no flush_completed).
{
  "flush_context": {
    "context_id": "my-session-123"
  }
}

4. close_context

Flushes any remaining text, sends final audio, and closes the WebSocket connection.
{
  "close_context": {
    "context_id": "my-session-123"
  }
}

Server → Client Messages

context_created

Confirms the session was initialized successfully.
{
  "context_created": {
    "context_id": "my-session-123"
  }
}

audio_chunk

Contains a chunk of synthesized audio. Multiple chunks are sent per flush. Each chunk is streamed as soon as it’s decoded for minimum latency.
{
  "audio_chunk": {
    "context_id": "my-session-123",
    "audioContent": "SGVsbG8gd29ybGQ..."
  }
}
Fields:
context_id
string
Session identifier
audioContent
string
Base64-encoded audio bytes
The audio encoding and chunk format depend on audio_config:
EncodingChunk formatNotes
LINEAR16 / PCMRaw PCM, 16-bit signed LE, monoChunks can be concatenated directly
MULAW / ULAWG.711 μ-law, 8-bit, monoChunks can be concatenated directly
OGG_OPUSIndependent Ogg Opus filesEach chunk is a self-contained OGG file. Chunks cannot be naively concatenated — decode each independently or use LINEAR16 for concatenatable streaming

flush_completed

Signals that all audio for the flushed text has been sent. Only sent on successful synthesis — if synthesis fails, an error is sent instead.
{
  "flush_completed": {
    "context_id": "my-session-123"
  }
}

context_closed

Confirms the session is closed. The WebSocket connection closes after this message.
{
  "context_closed": {
    "context_id": "my-session-123"
  }
}

error

Sent if synthesis fails or a protocol violation occurs.
{
  "error": {
    "context_id": "my-session-123",
    "message": "voice_id required"
  }
}
Common errors:
MessageCause
voice_id requiredcreate_context sent without voice_id
text_buffer exceeded 50000 character limitToo much text accumulated without flushing
Unsupported audio_encoding: XInvalid encoding value
Unsupported sample_rate_hertz: XInvalid sample rate

Complete Example

    Real-time Playback (Browser)

    For low-latency playback as chunks arrive (instead of waiting for all chunks), use the Web Audio API with scheduled AudioBufferSourceNode:
    let audioCtx;
    let nextPlayTime = 0;
    
    function onAudioChunk(base64Audio, sampleRate) {
      const binary = atob(base64Audio);
      const bytes = new Uint8Array(binary.length);
      for (let i = 0; i < binary.length; i++) bytes[i] = binary.charCodeAt(i);
    
      // Decode PCM int16 LE → Float32
      const int16 = new Int16Array(bytes.buffer);
      const float32 = new Float32Array(int16.length);
      for (let i = 0; i < int16.length; i++) float32[i] = int16[i] / 32768;
    
      if (!audioCtx) audioCtx = new AudioContext({ sampleRate });
      const buf = audioCtx.createBuffer(1, float32.length, sampleRate);
      buf.getChannelData(0).set(float32);
      const source = audioCtx.createBufferSource();
      source.buffer = buf;
      source.connect(audioCtx.destination);
    
      const now = audioCtx.currentTime;
      // First chunk: 150ms pre-buffer to absorb network jitter
      // Late chunks: schedule 20ms ahead for minimal gap
      if (nextPlayTime <= now) {
        nextPlayTime = now + (nextPlayTime === 0 ? 0.15 : 0.02);
      }
      source.start(nextPlayTime);
      nextPlayTime += float32.length / sampleRate;
    }
    

    LLM Integration Pattern

    When streaming tokens from an LLM into TTS:
    import json
    import base64
    import websockets
    
    async def llm_to_tts(llm_stream, voice_id):
        API_KEY = "sk_live_your_key"
        url = f"ws://api.60db.ai/ws/tts?apiKey={API_KEY}"
    
        async with websockets.connect(url) as ws:
            # Wait for connection
            await ws.recv()  # connection_established
    
            # Create context
            await ws.send(json.dumps({
                "create_context": {
                    "context_id": "llm-session",
                    "voice_id": voice_id,
                    "audio_config": {"audio_encoding": "MULAW", "sample_rate_hertz": 8000}
                }
            }))
            await ws.recv()  # context_created
    
            # Stream LLM tokens as text chunks
            async for token in llm_stream:
                await ws.send(json.dumps({
                    "send_text": {"context_id": "llm-session", "text": token}
                }))
    
            # Flush + close when LLM is done
            await ws.send(json.dumps({
                "flush_context": {"context_id": "llm-session"}
            }))
    
            audio = b""
            while True:
                msg = json.loads(await ws.recv())
                if "audio_chunk" in msg:
                    audio += base64.b64decode(msg["audio_chunk"]["audioContent"])
                elif "flush_completed" in msg:
                    break
                elif "error" in msg:
                    raise RuntimeError(msg["error"]["message"])
    
            await ws.send(json.dumps({
                "close_context": {"context_id": "llm-session"}
            }))
            await ws.recv()  # context_closed
    
            return audio
    

    Audio Format Notes

    EncodingFormatChunk behaviorBest for
    LINEAR16Raw PCM, 16-bit signed LE, monoConcatenatableGeneral purpose, highest quality
    MULAWG.711 μ-law, 8kHz, monoConcatenatableTelephony (Twilio, SIP)
    OGG_OPUSOgg Opus compressed, 24kHzNOT concatenatable — each chunk is a standalone OGG fileWeb playback, bandwidth-constrained
    For telephony integration (Twilio, etc.), use MULAW at 8000 Hz:
    "audio_config": {
      "audio_encoding": "MULAW",
      "sample_rate_hertz": 8000
    }
    
    For web playback with low bandwidth, use OGG_OPUS at 24000 Hz:
    "audio_config": {
      "audio_encoding": "OGG_OPUS",
      "sample_rate_hertz": 24000
    }
    
    Important: OGG_OPUS chunks are individually wrapped OGG files. To merge for download, decode each chunk independently (e.g., via AudioContext.decodeAudioData()) and concatenate the PCM output. Do not concatenate raw OGG bytes.

    Supported Languages

    The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text — no explicit language parameter is needed.
    LanguageID
    Englishen
    Hindihi
    Bengalibn
    Gujaratigu
    Kannadakn
    Malayalamml
    Marathimr
    Punjabipa
    Tamilta
    Telugute
    Assameseas
    Odiaor
    Mixed-language text is supported. Use newlines (\n) to separate paragraphs in different languages for best results.

    Default Voice

    The default voice ID is:
    fbb75ed2-975a-40c7-9e06-38e30524a9a1
    
    To get more voices, use the Voices API.

    Context Management

    Reuse Context

    Keep a context open for multiple syntheses:
    // Create once
    ws.send(JSON.stringify({
      create_context: { context_id, voice_id, audio_config }
    }));
    
    // Send multiple texts
    ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
    ws.send(JSON.stringify({ flush_context: { context_id } }));
    
    ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
    ws.send(JSON.stringify({ flush_context: { context_id } }));
    
    // Close when done
    ws.send(JSON.stringify({ close_context: { context_id } }));
    

    Multiple Contexts

    You can create multiple contexts in one connection:
    const context1 = 'ctx-1';
    const context2 = 'ctx-2';
    
    // Create both contexts
    ws.send(JSON.stringify({
      create_context: {
        context_id: context1,
        voice_id: voice1,
        audio_config
      }
    }));
    
    ws.send(JSON.stringify({
      create_context: {
        context_id: context2,
        voice_id: voice2,
        audio_config
      }
    }));
    

    Pricing

    • Rate: $0.00002 per character
    • Minimum: $0.01 per context
    • Billing: Per character synthesized

    Error Codes

    CodeDescription
    1008Authentication failed
    1008Insufficient credits
    1011Voice not found
    1011Invalid audio config
    1006Connection lost

    Testing

    # Install wscat
    npm install -g wscat
    
    # Test TTS
    wscat -c "ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key"
    
    Then send:
    {"create_context":{"context_id":"test-123","voice_id":"fbb75ed2-975a-40c7-9e06-38e30524a9a1","audio_config":{"audio_encoding":"LINEAR16","sample_rate_hertz":16000}}}
    {"send_text":{"context_id":"test-123","text":"Hello"}}
    {"flush_context":{"context_id":"test-123"}}