Speech-to-Text

Overview

60db’s Speech-to-Text (STT) API converts spoken audio into written text with high accuracy across 39 languages, including code-switched Indic+English. Powered by 60db STT v01 (a non-hallucinating, multi-backend speech recognition stack) — non-hallucinating models that don’t invent text on silent or noisy input.

Features

Multi-Language

39 languages with auto-detection and Indic+English code-switching

Speaker Diarization

Opt-in pyannote speaker diarization via diarize: true

Timestamps

Word-level timestamps included automatically

Non-hallucinating

Non-hallucinating backend that emits blank tokens on silence — no phantom text

Basic Usage

JavaScript
Python

import { SixtyDBClient } from '60db';

const client = new SixtyDBClient('your-api-key');

const file = document.querySelector('input[type="file"]').files[0];

const result = await client.speechToText(file, {
  language: 'en'
});

console.log('Transcription:', result.text);
console.log('Confidence:', result.confidence);

from sixtydb import SixtyDBClient

client = SixtyDBClient('your-api-key')

with open('recording.mp3', 'rb') as audio_file:
    result = client.speech_to_text(audio_file, language='en')
    print(f"Transcription: {result['text']}")
    print(f"Confidence: {result['confidence']}")

Supported Formats

Format	Max Size	Max Duration	Quality
MP3	25MB	10 min	Good
WAV	25MB	10 min	Excellent
FLAC	25MB	10 min	Lossless
OGG	25MB	10 min	Good
M4A	25MB	10 min	Good

Language Support

Auto-Detection

Let the API automatically identify the language. Omit the language field (or pass the string "auto") and the server runs language identification across all 39 supported languages:

// Simplest form — omit language entirely
const result = await client.speechToText(file);

// Equivalent — the REST endpoint treats "auto" as omission.
// Note: on the WebSocket streaming endpoint you must use `languages: null`,
// not the literal string "auto".
const result2 = await client.speechToText(file, { language: 'auto' });

console.log('Detected language:', result.language);

Specify Language

For lowest latency, pass a single ISO 639-1 code and the server skips language identification entirely:

const result = await client.speechToText(file, {
  language: 'hi'  // ISO 639-1 code
});

Unsupported codes (ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he) and Arabic dialect tags (ar-eg, ar-lv, …) return an unsupported_language error. For non-MSA Arabic audio, pass ar for best-effort MSA transcription.

Get Supported Languages

const languages = await client.getLanguages();
languages.forEach(lang => {
  console.log(`${lang.name} (${lang.code})`);
});

Advanced Features

Word-Level Timestamps

Word timings are always included in the response — no flag needed:

const result = await client.speechToText(file);

// Word-level timing lives inside each segment
result.segments.forEach(segment => {
  segment.words.forEach(word => {
    console.log(`${word.word}: ${word.start}s - ${word.end}s (${word.confidence})`);
  });
});

Speaker Diarization

Identify different speakers with diarize: true:

const result = await client.speechToText(file, {
  diarize: true
});

// Each segment carries a `speakers` array when diarization is enabled.
// Raw labels are SPEAKER_00, SPEAKER_01, …
result.segments.forEach(segment => {
  const speaker = segment.speakers?.[0]?.speaker ?? 'unknown';
  console.log(`[${speaker}] ${segment.text}`);
});

Client applications typically re-label the raw SPEAKER_NN IDs as “Speaker 1”, “Speaker 2” in order of first appearance for readability.

Best Practices

Audio Quality

Use high-quality recordings (16kHz+ sample rate)
Minimize background noise
Ensure clear speech
Avoid audio compression when possible

Accuracy Tips

Specify the language when known
Use appropriate model for your use case
Provide clean audio without music
Split very long recordings

Performance

Keep files under 25MB
Use appropriate format (WAV for quality, MP3 for size)
Process in batches for multiple files — but stay under the 8-concurrent STT cap per user

Use Cases

Meeting Transcription

with open('meeting.mp3', 'rb') as audio:
    result = client.speech_to_text(
        audio,
        diarize=True,
    )

    # Generate meeting notes
    for segment in result['segments']:
        speaker = (segment.get('speakers') or [{}])[0].get('speaker', 'unknown')
        print(f"[{segment['start']:.1f}s] {speaker}: {segment['text']}")

Voice Commands

async function processVoiceCommand(audioBlob) {
  const result = await client.speechToText(audioBlob, {
    language: 'en',
  });

  const command = parseCommand(result.text);
  executeCommand(command);
}

Subtitle Generation

const result = await client.speechToText(videoAudio);

const subtitles = generateSRT(result.words);
saveFile('subtitles.srt', subtitles);

API Reference

Speech to Text API

View complete API documentation

Get Started

SDKs

Core Features

Speech-to-Text

Overview

Features

Multi-Language

Speaker Diarization

Timestamps

Non-hallucinating

Basic Usage

Supported Formats

Language Support

Auto-Detection

Specify Language

Get Supported Languages

Advanced Features

Word-Level Timestamps

Speaker Diarization

Best Practices

Use Cases

Meeting Transcription

Voice Commands

Subtitle Generation

API Reference

Speech to Text API

Get Started

SDKs

Core Features

Documentation Index

​Overview

​Features

Multi-Language

Speaker Diarization

Timestamps

Non-hallucinating

​Basic Usage

​Supported Formats

​Language Support

​Auto-Detection

​Specify Language

​Get Supported Languages

​Advanced Features

​Word-Level Timestamps

​Speaker Diarization

​Best Practices

​Use Cases

​Meeting Transcription

​Voice Commands

​Subtitle Generation

​API Reference

Speech to Text API

Overview

Features

Basic Usage

Supported Formats

Language Support

Auto-Detection

Specify Language

Get Supported Languages

Advanced Features

Word-Level Timestamps

Speaker Diarization

Best Practices

Use Cases

Meeting Transcription

Voice Commands

Subtitle Generation

API Reference