Skip to main content

TTS WebSocket API

Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.

🚀 Quick Start (Copy & Paste)

const WebSocket = require('ws');
const fs = require('fs');

// 1. Your API key
const API_KEY = 'sk_live_your_api_key';

// 2. Connect
const ws = new WebSocket(`wss://api.60db.ai/ws/tts?apiKey=${API_KEY}`);

// 3. Unique context ID
const contextId = 'my-session-' + Date.now();
const audioChunks = [];

// 4. Handle messages
ws.on('message', (data) => {
  const msg = JSON.parse(data);

  // Authenticated? Create context!
  if (msg.connection_established) {
    console.log('✅ Authenticated');
    ws.send(JSON.stringify({
      create_context: {
        context_id: contextId,
        voice_id: 'fbb75ed2-975a-40c7-9e06-38e30524a9a1',
        audio_config: { audio_encoding: 'LINEAR16', sample_rate_hertz: 16000 }
      }
    }));
  }

  // Context ready? Send text!
  if (msg.context_created) {
    console.log('✅ Context created');

    // Send your text
    ws.send(JSON.stringify({
      send_text: { context_id: contextId, text: 'Hello, world!' }
    }));

    // Flush to get audio
    ws.send(JSON.stringify({
      flush_context: { context_id: contextId }
    }));
  }

  // Got audio? Save it!
  if (msg.audio_chunk) {
    const audioData = Buffer.from(msg.audio_chunk.audioContent, 'base64');
    audioChunks.push(audioData);
    console.log('🔊 Audio chunk:', audioData.length, 'bytes');
  }

  // All audio received?
  if (msg.flush_completed) {
    console.log('✅ All audio received!');
    console.log('   Total:', audioChunks.length, 'chunks');

    // Close context
    ws.send(JSON.stringify({
      close_context: { context_id: contextId }
    }));
  }

  // Done! Save audio
  if (msg.context_closed) {
    console.log('✅ Complete!');

    // Save to file
    const completeAudio = Buffer.concat(audioChunks);
    fs.writeFileSync('output.pcm', completeAudio);
    console.log('💾 Saved: output.pcm');
    console.log('   Size:', completeAudio.length, 'bytes');

    ws.close();
  }
});
That’s it! You’ll see:
  • ✅ Authenticated
  • ✅ Context created
  • 🔊 Audio chunk: 1024 bytes (multiple times)
  • ✅ All audio received!
  • ✅ Complete!
  • 💾 Saved: output.pcm

📖 How It Works (5 Simple Steps)

  1. Connect with your API key
  2. Create a context with voice settings
  3. Send your text message
  4. Flush to trigger synthesis
  5. Close when done (receive audio file)

Endpoint

Authentication

Query parameter authentication: Example:
ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key

Protocol Overview

Client                                  Server
  |                                       |
  |─── create_context ──────────────────▶ |
  |◀── context_created ─────────────────  |
  |                                       |
  |─── send_text ───────────────────────▶ |
  |─── flush_context ───────────────────▶ |
  |◀── audio_chunk #1 ──────────────────  |
  |◀── audio_chunk #N ──────────────────  |
  |◀── flush_completed ─────────────────  |
  |                                       |
  |─── close_context ───────────────────▶ |
  |◀── context_closed ──────────────────  |

Connection Sequence

1. Connect

const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');

2. Receive Authentication Message

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

3. Receive Connection Established

{
  "connection_established": {
    "service": "tts",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}
Fields:
service
string
Service name: "tts"
user_id
integer
Your user ID
credit_balance
number
Available credits
workspace
string
Workspace name

Client → Server Messages

1. create_context

Must be the first message. Initializes the TTS session with voice and audio settings.
{
  "create_context": {
    "context_id": "my-session-123",
    "voice_id": "7911a3e8",
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "sample_rate_hertz": 16000
    },
    "speed": 1,
    "stability": 50,
    "similarity": 75
  }
}
Parameters: Supported encoding + sample rate combinations:
audio_encodingSupported sample_rate_hertzOutput format
LINEAR168000, 16000 (default), 24000, 48000Raw PCM, 16-bit signed little-endian, mono
PCM8000, 16000 (default), 24000, 48000Same as LINEAR16
MULAW8000G.711 μ-law encoded, mono
ULAW8000Same as MULAW
OGG_OPUS24000Ogg Opus compressed audio
Note: MULAW/ULAW only works at 8000 Hz. OGG_OPUS only works at 24000 Hz.
Limits:
ParameterMinMaxDefault
speed0.52.01
stability010050
similarity010075
text (per send_text)1 char
text buffer (accumulated)50,000 chars

2. send_text

Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.
{
  "send_text": {
    "context_id": "my-session-123",
    "text": "Hello, how are you doing today?"
  }
}
Fields: You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):
{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}

3. flush_context

Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed.
{
  "flush_context": {
    "context_id": "my-session-123"
  }
}

4. close_context

Flushes any remaining text, sends final audio, and closes the WebSocket connection.
{
  "close_context": {
    "context_id": "my-session-123"
  }
}

Server → Client Messages

context_created

Confirms the session was initialized successfully.
{
  "context_created": {
    "context_id": "my-session-123"
  }
}

audio_chunk

Contains a chunk of synthesized audio. Multiple chunks are sent per flush.
{
  "audio_chunk": {
    "context_id": "my-session-123",
    "audioContent": "SGVsbG8gd29ybGQ..."
  }
}
Fields:
context_id
string
Session identifier
audioContent
string
Base64-encoded audio bytes
The audio encoding and chunk format depend on audio_config:
EncodingChunk formatNotes
LINEAR16 / PCMRaw PCM, 16-bit signed LE, monoChunks can be concatenated directly
MULAW / ULAWG.711 μ-law, 8-bit, monoChunks can be concatenated directly
OGG_OPUSIndependent Ogg Opus filesEach chunk is a self-contained OGG file. Chunks cannot be naively concatenated

flush_completed

Signals that all audio for the flushed text has been sent.
{
  "flush_completed": {
    "context_id": "my-session-123"
  }
}

context_closed

Confirms the session is closed. The WebSocket connection closes after this message.
{
  "context_closed": {
    "context_id": "my-session-123"
  }
}

error

Sent if synthesis fails or a protocol violation occurs.
{
  "error": {
    "context_id": "my-session-123",
    "message": "voice_id required"
  }
}
Common errors:
MessageCause
voice_id requiredcreate_context sent without voice_id
text_buffer exceeded 50000 character limitToo much text accumulated without flushing
Unsupported audio_encoding: XInvalid encoding value
Unsupported sample_rate_hertz: XInvalid sample rate

Complete Example

    Audio Configuration

    EncodingSample RatesDescription
    LINEAR168000, 16000, 24000, 48000PCM 16-bit signed
    MULAW8000G.711 μ-law (telephony)
    OGG_OPUS24000Compressed audio
    For telephony integration (Twilio, etc.), use MULAW at 8000 Hz.

    Default Voice

    The default voice ID is:
    fbb75ed2-975a-40c7-9e06-38e30524a9a1
    
    To get more voices, use the Voices API.

    Context Management

    Reuse Context

    Keep a context open for multiple syntheses:
    // Create once
    ws.send(JSON.stringify({
      create_context: { context_id, voice_id, audio_config }
    }));
    
    // Send multiple texts
    ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
    ws.send(JSON.stringify({ flush_context: { context_id } }));
    
    ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
    ws.send(JSON.stringify({ flush_context: { context_id } }));
    
    // Close when done
    ws.send(JSON.stringify({ close_context: { context_id } }));
    

    Multiple Contexts

    You can create multiple contexts in one connection:
    const context1 = 'ctx-1';
    const context2 = 'ctx-2';
    
    // Create both contexts
    ws.send(JSON.stringify({
      create_context: { context_id: context1, voice_id: voice1, audio_config }
    }));
    
    ws.send(JSON.stringify({
      create_context: { context_id: context2, voice_id: voice2, audio_config }
    }));
    

    Supported Languages

    The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text.
    LanguageID
    Englishen
    Hindihi
    Bengalibn
    Gujaratigu
    Kannadakn
    Malayalamml
    Marathimr
    Punjabipa
    Tamilta
    Telugute

    Pricing

    • Rate: $0.00002 per character
    • Minimum: $0.01 per context
    • Billing: Per character synthesized