TTS WebSocket API
Real-time Text-to-Speech synthesis via WebSocket streaming with full-duplex bidirectional communication.
🚀 Quick Start (Copy & Paste)
const WebSocket = require('ws');
const fs = require('fs');
// 1. Your API key
const API_KEY = 'sk_live_your_api_key';
// 2. Connect
const ws = new WebSocket(`wss://api.60db.ai/ws/tts?apiKey=${API_KEY}`);
// 3. Unique context ID
const contextId = 'my-session-' + Date.now();
const audioChunks = [];
// 4. Handle messages
ws.on('message', (data) => {
const msg = JSON.parse(data);
// Authenticated? Create context!
if (msg.connection_established) {
console.log('✅ Authenticated');
ws.send(JSON.stringify({
create_context: {
context_id: contextId,
voice_id: 'fbb75ed2-975a-40c7-9e06-38e30524a9a1',
audio_config: { audio_encoding: 'LINEAR16', sample_rate_hertz: 16000 }
}
}));
}
// Context ready? Send text!
if (msg.context_created) {
console.log('✅ Context created');
// Send your text
ws.send(JSON.stringify({
send_text: { context_id: contextId, text: 'Hello, world!' }
}));
// Flush to get audio
ws.send(JSON.stringify({
flush_context: { context_id: contextId }
}));
}
// Got audio? Save it!
if (msg.audio_chunk) {
const audioData = Buffer.from(msg.audio_chunk.audioContent, 'base64');
audioChunks.push(audioData);
console.log('🔊 Audio chunk:', audioData.length, 'bytes');
}
// All audio received?
if (msg.flush_completed) {
console.log('✅ All audio received!');
console.log(' Total:', audioChunks.length, 'chunks');
// Close context
ws.send(JSON.stringify({
close_context: { context_id: contextId }
}));
}
// Done! Save audio
if (msg.context_closed) {
console.log('✅ Complete!');
// Save to file
const completeAudio = Buffer.concat(audioChunks);
fs.writeFileSync('output.pcm', completeAudio);
console.log('💾 Saved: output.pcm');
console.log(' Size:', completeAudio.length, 'bytes');
ws.close();
}
});
That’s it! You’ll see:
- ✅ Authenticated
- ✅ Context created
- 🔊 Audio chunk: 1024 bytes (multiple times)
- ✅ All audio received!
- ✅ Complete!
- 💾 Saved: output.pcm
📖 How It Works (5 Simple Steps)
- Connect with your API key
- Create a context with voice settings
- Send your text message
- Flush to trigger synthesis
- Close when done (receive audio file)
Endpoint
Authentication
Query parameter authentication:
Example:
ws://api.60db.ai/ws/tts?apiKey=sk_live_your_api_key
Protocol Overview
Client Server
| |
|─── create_context ──────────────────▶ |
|◀── context_created ───────────────── |
| |
|─── send_text ───────────────────────▶ |
|─── flush_context ───────────────────▶ |
|◀── audio_chunk #1 ────────────────── |
|◀── audio_chunk #N ────────────────── |
|◀── flush_completed ───────────────── |
| |
|─── close_context ───────────────────▶ |
|◀── context_closed ────────────────── |
Connection Sequence
1. Connect
const ws = new WebSocket('ws://api.60db.ai/ws/tts?apiKey=sk_live_your_key');
2. Receive Authentication Message
{
"connecting": true,
"message": "Authenticating...",
"timestamp": 1775465918269
}
3. Receive Connection Established
{
"connection_established": {
"service": "tts",
"user_id": 43,
"credit_balance": 9.97,
"workspace": "default"
}
}
Fields:
Client → Server Messages
1. create_context
Must be the first message. Initializes the TTS session with voice and audio settings.
{
"create_context": {
"context_id": "my-session-123",
"voice_id": "7911a3e8",
"audio_config": {
"audio_encoding": "LINEAR16",
"sample_rate_hertz": 16000
},
"speed": 1,
"stability": 50,
"similarity": 75
}
}
Parameters:
Supported encoding + sample rate combinations:
audio_encoding | Supported sample_rate_hertz | Output format |
|---|
LINEAR16 | 8000, 16000 (default), 24000, 48000 | Raw PCM, 16-bit signed little-endian, mono |
PCM | 8000, 16000 (default), 24000, 48000 | Same as LINEAR16 |
MULAW | 8000 | G.711 μ-law encoded, mono |
ULAW | 8000 | Same as MULAW |
OGG_OPUS | 24000 | Ogg Opus compressed audio |
Note: MULAW/ULAW only works at 8000 Hz. OGG_OPUS only works at 24000 Hz.
Limits:
| Parameter | Min | Max | Default |
|---|
speed | 0.5 | 2.0 | 1 |
stability | 0 | 100 | 50 |
similarity | 0 | 100 | 75 |
text (per send_text) | 1 char | — | — |
| text buffer (accumulated) | — | 50,000 chars | — |
2. send_text
Append text to the internal buffer. Text is accumulated until a flush_context or close_context is received.
{
"send_text": {
"context_id": "my-session-123",
"text": "Hello, how are you doing today?"
}
}
Fields:
You can send multiple send_text messages to build up text incrementally (e.g., from an LLM token stream):
{"send_text": {"context_id": "ctx-1", "text": "Hello, "}}
{"send_text": {"context_id": "ctx-1", "text": "how are you "}}
{"send_text": {"context_id": "ctx-1", "text": "doing today?"}}
3. flush_context
Triggers synthesis of all accumulated text. The server responds with audio_chunk messages followed by flush_completed.
{
"flush_context": {
"context_id": "my-session-123"
}
}
4. close_context
Flushes any remaining text, sends final audio, and closes the WebSocket connection.
{
"close_context": {
"context_id": "my-session-123"
}
}
Server → Client Messages
context_created
Confirms the session was initialized successfully.
{
"context_created": {
"context_id": "my-session-123"
}
}
audio_chunk
Contains a chunk of synthesized audio. Multiple chunks are sent per flush.
{
"audio_chunk": {
"context_id": "my-session-123",
"audioContent": "SGVsbG8gd29ybGQ..."
}
}
Fields:
Base64-encoded audio bytes
The audio encoding and chunk format depend on audio_config:
| Encoding | Chunk format | Notes |
|---|
LINEAR16 / PCM | Raw PCM, 16-bit signed LE, mono | Chunks can be concatenated directly |
MULAW / ULAW | G.711 μ-law, 8-bit, mono | Chunks can be concatenated directly |
OGG_OPUS | Independent Ogg Opus files | Each chunk is a self-contained OGG file. Chunks cannot be naively concatenated |
flush_completed
Signals that all audio for the flushed text has been sent.
{
"flush_completed": {
"context_id": "my-session-123"
}
}
context_closed
Confirms the session is closed. The WebSocket connection closes after this message.
{
"context_closed": {
"context_id": "my-session-123"
}
}
error
Sent if synthesis fails or a protocol violation occurs.
{
"error": {
"context_id": "my-session-123",
"message": "voice_id required"
}
}
Common errors:
| Message | Cause |
|---|
voice_id required | create_context sent without voice_id |
text_buffer exceeded 50000 character limit | Too much text accumulated without flushing |
Unsupported audio_encoding: X | Invalid encoding value |
Unsupported sample_rate_hertz: X | Invalid sample rate |
Complete Example
Audio Configuration
| Encoding | Sample Rates | Description |
|---|
LINEAR16 | 8000, 16000, 24000, 48000 | PCM 16-bit signed |
MULAW | 8000 | G.711 μ-law (telephony) |
OGG_OPUS | 24000 | Compressed audio |
For telephony integration (Twilio, etc.), use MULAW at 8000 Hz.
Default Voice
The default voice ID is:
fbb75ed2-975a-40c7-9e06-38e30524a9a1
To get more voices, use the Voices API.
Context Management
Reuse Context
Keep a context open for multiple syntheses:
// Create once
ws.send(JSON.stringify({
create_context: { context_id, voice_id, audio_config }
}));
// Send multiple texts
ws.send(JSON.stringify({ send_text: { context_id, text: "Hello" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));
ws.send(JSON.stringify({ send_text: { context_id, text: "World" } }));
ws.send(JSON.stringify({ flush_context: { context_id } }));
// Close when done
ws.send(JSON.stringify({ close_context: { context_id } }));
Multiple Contexts
You can create multiple contexts in one connection:
const context1 = 'ctx-1';
const context2 = 'ctx-2';
// Create both contexts
ws.send(JSON.stringify({
create_context: { context_id: context1, voice_id: voice1, audio_config }
}));
ws.send(JSON.stringify({
create_context: { context_id: context2, voice_id: voice2, audio_config }
}));
Supported Languages
The TTS model supports synthesis in multiple Indic languages and English. The language is auto-detected from the input text.
| Language | ID |
|---|
| English | en |
| Hindi | hi |
| Bengali | bn |
| Gujarati | gu |
| Kannada | kn |
| Malayalam | ml |
| Marathi | mr |
| Punjabi | pa |
| Tamil | ta |
| Telugu | te |
Pricing
- Rate: $0.00002 per character
- Minimum: $0.01 per context
- Billing: Per character synthesized