TTS WebSocket API

Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.

🚀 Quick Start (Copy & Paste)

const WebSocket = require('ws');
const fs = require('fs');

// 1. Your API key
const API_KEY = 'sk_live_your_api_key';

// 2. Connect
const ws = new WebSocket(`wss://api.60db.ai/ws/tts?apiKey=${API_KEY}`);

// 3. Unique context ID
const contextId = 'my-session-' + Date.now();
const audioChunks = [];

// 4. Handle messages
ws.on('message', (data) => {
  const msg = JSON.parse(data);

  // Authenticated? Create context!
  if (msg.connection_established) {
    console.log('✅ Authenticated');
    ws.send(JSON.stringify({
      create_context: {
        context_id: contextId,
        voice_id: 'fbb75ed2-975a-40c7-9e06-38e30524a9a1',
        audio_config: { audio_encoding: 'LINEAR16', sample_rate_hertz: 16000 }
      }
    }));
  }

  // Context ready? Send text!
  if (msg.context_created) {
    console.log('✅ Context created');

    // Send your text
    ws.send(JSON.stringify({
      send_text: { context_id: contextId, text: 'Hello, world!' }
    }));

    // Flush to get audio
    ws.send(JSON.stringify({
      flush_context: { context_id: contextId }
    }));
  }

  // Got audio? Save it!
  if (msg.audio_chunk) {
    const audioData = Buffer.from(msg.audio_chunk.audioContent, 'base64');
    audioChunks.push(audioData);
    console.log('🔊 Audio chunk:', audioData.length, 'bytes');
  }

  // All audio received?
  if (msg.flush_completed) {
    console.log('✅ All audio received!');
    console.log('   Total:', audioChunks.length, 'chunks');

    // Close context
    ws.send(JSON.stringify({
      close_context: { context_id: contextId }
    }));
  }

  // Done! Save audio
  if (msg.context_closed) {
    console.log('✅ Complete!');

    // Save to file
    const completeAudio = Buffer.concat(audioChunks);
    fs.writeFileSync('output.pcm', completeAudio);
    console.log('💾 Saved: output.pcm');
    console.log('   Size:', completeAudio.length, 'bytes');

    ws.close();
  }
});

That’s it! You’ll see:

✅ Authenticated
✅ Context created
🔊 Audio chunk: 1024 bytes (multiple times)
✅ All audio received!
✅ Complete!
💾 Saved: output.pcm

📖 How It Works (5 Simple Steps)

Connect with your API key
Create a context with voice settings
Send your text message
Flush to trigger synthesis
Close when done (receive audio file)

Endpoint

Authentication

Query parameter authentication: Example:

ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key

Protocol Overview

Client                                  Server
  |                                       |
  |─── create_context ──────────────────▶ |
  |◀── context_created ─────────────────  |
  |                                       |
  |─── send_text ───────────────────────▶ |
  |─── flush_context ───────────────────▶ |
  |◀── audio_chunk #1 ──────────────────  |
  |◀── audio_chunk #N ──────────────────  |
  |◀── flush_completed ─────────────────  |
  |                                       |
  |─── close_context ───────────────────▶ |
  |◀── context_closed ──────────────────  |

Connection Sequence

1. Connect

const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');

2. Receive Authentication Message

{
  "connecting": true,
  "message": "Authenticating...",
  "timestamp": 1775465918269
}

3. Receive Connection Established

{
  "connection_established": {
    "service": "tts",
    "user_id": 43,
    "credit_balance": 9.97,
    "workspace": "default"
  }
}

Fields:

service

string

Service name: "tts"

user_id

integer

Your user ID

credit_balance

number

Available credits

workspace

string

Workspace name

Client → Server Messages

1. create_context

Must be the first message. Initializes the TTS session with voice and audio settings.

{
  "create_context": {
    "context_id": "my-session-123",
    "voice_id": "7911a3e8",
    "audio_config": {
      "audio_encoding": "LINEAR16",
      "sample_rate_hertz": 16000
    },
    "speed": 1,
    "stability": 50,
    "similarity": 75
  }
}

Parameters: Supported encoding + sample rate combinations:

`audio_encoding`	Supported `sample_rate_hertz`	Output format
`LINEAR16`	`8000`, `16000` (default), `24000`, `48000`	Raw PCM, 16-bit signed little-endian, mono
`PCM`	`8000`, `16000` (default), `24000`, `48000`	Same as LINEAR16
`MULAW`	`8000`	G.711 μ-law encoded, mono
`ULAW`	`8000`	Same as MULAW
`OGG_OPUS`	`24000`	Ogg Opus compressed audio

Note: MULAW/ULAW only works at 8000 Hz. OGG_OPUS only works at 24000 Hz.

Limits:

Parameter	Min	Max	Default
`speed`	0.5	2.0	1
`stability`	0	100	50
`similarity`	0	100	75
`text` (per send_text)	1 char	—	—
text buffer (accumulated)	—	50,000 chars	—

2. send_text

Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.

{
  "send_text": {
    "context_id": "my-session-123",
    "text": "Hello, how are you doing today?"
  }
}

Fields: You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):

{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}

3. flush_context

Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed.

{
  "flush_context": {
    "context_id": "my-session-123"
  }
}

4. close_context

Flushes any remaining text, sends final audio, and closes the WebSocket connection.

{
  "close_context": {
    "context_id": "my-session-123"
  }
}

Server → Client Messages

context_created

Confirms the session was initialized successfully.

{
  "context_created": {
    "context_id": "my-session-123"
  }
}

audio_chunk

Contains a chunk of synthesized audio. Multiple chunks are sent per flush.

{
  "audio_chunk": {
    "context_id": "my-session-123",
    "audioContent": "SGVsbG8gd29ybGQ..."
  }
}

Fields:

context_id

string

Session identifier

audioContent

string

Base64-encoded audio bytes

The audio encoding and chunk format depend on audio_config:

Encoding	Chunk format	Notes
`LINEAR16` / `PCM`	Raw PCM, 16-bit signed LE, mono	Chunks can be concatenated directly
`MULAW` / `ULAW`	G.711 μ-law, 8-bit, mono	Chunks can be concatenated directly
`OGG_OPUS`	Independent Ogg Opus files	Each chunk is a self-contained OGG file. Chunks cannot be naively concatenated

flush_completed

Signals that all audio for the flushed text has been sent.

{
  "flush_completed": {
    "context_id": "my-session-123"
  }
}

context_closed

Confirms the session is closed. The WebSocket connection closes after this message.

{
  "context_closed": {
    "context_id": "my-session-123"
  }
}

error

Sent if synthesis fails or a protocol violation occurs.

{
  "error": {
    "context_id": "my-session-123",
    "message": "voice_id required"
  }
}

Common errors:

Message	Cause
`voice_id required`	`create_context` sent without `voice_id`
`text_buffer exceeded 50000 character limit`	Too much text accumulated without flushing
`Unsupported audio_encoding: X`	Invalid encoding value
`Unsupported sample_rate_hertz: X`	Invalid sample rate

Complete Example

Audio Configuration

Encoding	Sample Rates	Description
`LINEAR16`	8000, 16000, 24000, 48000	PCM 16-bit signed
`MULAW`	8000	G.711 μ-law (telephony)
`OGG_OPUS`	24000	Compressed audio

For telephony integration (Twilio, etc.), use MULAW at 8000 Hz.

Default Voice

The default voice ID is:

fbb75ed2-975a-40c7-9e06-38e30524a9a1

To get more voices, use the Voices API.

Context Management

Reuse Context

Keep a context open for multiple syntheses:

// Create once
ws.send(JSON.stringify({
  create_context: { context_id, voice_id, audio_config }
}));

// Send multiple texts
ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));

ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));

// Close when done
ws.send(JSON.stringify({ close_context: { context_id } }));

Multiple Contexts

You can create multiple contexts in one connection:

const context1 = 'ctx-1';
const context2 = 'ctx-2';

// Create both contexts
ws.send(JSON.stringify({
  create_context: { context_id: context1, voice_id: voice1, audio_config }
}));

ws.send(JSON.stringify({
  create_context: { context_id: context2, voice_id: voice2, audio_config }
}));

Supported Languages

The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text.

Language	ID
English	en
Hindi	hi
Bengali	bn
Gujarati	gu
Kannada	kn
Malayalam	ml
Marathi	mr
Punjabi	pa
Tamil	ta
Telugu	te

Pricing

Rate: $0.00002 per character
Minimum: $0.01 per context
Billing: Per character synthesized

STT WebSocket - Speech-to-Text endpoint
WebSocket Quick Start - Get started guide
WebSocket Playground - Test in browser
Voices API - Get available voices

​TTS WebSocket API

​🚀 Quick Start (Copy & Paste)

​📖 How It Works (5 Simple Steps)

​Endpoint

​Authentication

​Protocol Overview

​Connection Sequence

​1. Connect

​2. Receive Authentication Message

​3. Receive Connection Established

​Client → Server Messages

​1. create_context

​2. send_text

​3. flush_context

​4. close_context

​Server → Client Messages

​context_created

​audio_chunk

​flush_completed

​context_closed

​error

​Complete Example

​Audio Configuration

​Default Voice

​Context Management

​Reuse Context

​Multiple Contexts

​Supported Languages

​Pricing

​Related

TTS WebSocket API

🚀 Quick Start (Copy & Paste)

📖 How It Works (5 Simple Steps)

Endpoint

Authentication

Protocol Overview

Connection Sequence

1. Connect

2. Receive Authentication Message

3. Receive Connection Established

Client → Server Messages

1. create_context

2. send_text

3. flush_context

4. close_context

Server → Client Messages

context_created

audio_chunk

flush_completed

context_closed

error

Complete Example

Audio Configuration

Default Voice

Context Management

Reuse Context

Multiple Contexts

Supported Languages

Pricing

Related