Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.60db.ai/llms.txt

Use this file to discover all available pages before exploring further.

Overview

60db’s Speech-to-Text (STT) API converts spoken audio into written text with high accuracy across 39 languages, including code-switched Indic+English. Powered by 60db STT v01 (a non-hallucinating, multi-backend speech recognition stack) — non-hallucinating models that don’t invent text on silent or noisy input.

Features

Multi-Language

39 languages with auto-detection and Indic+English code-switching

Speaker Diarization

Opt-in pyannote speaker diarization via diarize: true

Timestamps

Word-level timestamps included automatically

Non-hallucinating

Non-hallucinating backend that emits blank tokens on silence — no phantom text

Basic Usage

import { SixtyDBClient } from '60db';

const client = new SixtyDBClient('your-api-key');

const file = document.querySelector('input[type="file"]').files[0];

const result = await client.speechToText(file, {
  language: 'en'
});

console.log('Transcription:', result.text);
console.log('Confidence:', result.confidence);

Supported Formats

FormatMax SizeMax DurationQuality
MP325MB10 minGood
WAV25MB10 minExcellent
FLAC25MB10 minLossless
OGG25MB10 minGood
M4A25MB10 minGood

Language Support

Auto-Detection

Let the API automatically identify the language. Omit the language field (or pass the string "auto") and the server runs language identification across all 39 supported languages:
// Simplest form — omit language entirely
const result = await client.speechToText(file);

// Equivalent — the REST endpoint treats "auto" as omission.
// Note: on the WebSocket streaming endpoint you must use `languages: null`,
// not the literal string "auto".
const result2 = await client.speechToText(file, { language: 'auto' });

console.log('Detected language:', result.language);

Specify Language

For lowest latency, pass a single ISO 639-1 code and the server skips language identification entirely:
const result = await client.speechToText(file, {
  language: 'hi'  // ISO 639-1 code
});
Unsupported codes (ur, ja, ko, zh, th, vi, id, tl, sw, tr, fa, he) and Arabic dialect tags (ar-eg, ar-lv, …) return an unsupported_language error. For non-MSA Arabic audio, pass ar for best-effort MSA transcription.

Get Supported Languages

const languages = await client.getLanguages();
languages.forEach(lang => {
  console.log(`${lang.name} (${lang.code})`);
});

Advanced Features

Word-Level Timestamps

Word timings are always included in the response — no flag needed:
const result = await client.speechToText(file);

// Word-level timing lives inside each segment
result.segments.forEach(segment => {
  segment.words.forEach(word => {
    console.log(`${word.word}: ${word.start}s - ${word.end}s (${word.confidence})`);
  });
});

Speaker Diarization

Identify different speakers with diarize: true:
const result = await client.speechToText(file, {
  diarize: true
});

// Each segment carries a `speakers` array when diarization is enabled.
// Raw labels are SPEAKER_00, SPEAKER_01, …
result.segments.forEach(segment => {
  const speaker = segment.speakers?.[0]?.speaker ?? 'unknown';
  console.log(`[${speaker}] ${segment.text}`);
});
Client applications typically re-label the raw SPEAKER_NN IDs as “Speaker 1”, “Speaker 2” in order of first appearance for readability.

Best Practices

  • Use high-quality recordings (16kHz+ sample rate)
  • Minimize background noise
  • Ensure clear speech
  • Avoid audio compression when possible
  • Specify the language when known
  • Use appropriate model for your use case
  • Provide clean audio without music
  • Split very long recordings
  • Keep files under 25MB
  • Use appropriate format (WAV for quality, MP3 for size)
  • Process in batches for multiple files — but stay under the 8-concurrent STT cap per user

Use Cases

Meeting Transcription

with open('meeting.mp3', 'rb') as audio:
    result = client.speech_to_text(
        audio,
        diarize=True,
    )

    # Generate meeting notes
    for segment in result['segments']:
        speaker = (segment.get('speakers') or [{}])[0].get('speaker', 'unknown')
        print(f"[{segment['start']:.1f}s] {speaker}: {segment['text']}")

Voice Commands

async function processVoiceCommand(audioBlob) {
  const result = await client.speechToText(audioBlob, {
    language: 'en',
  });

  const command = parseCommand(result.text);
  executeCommand(command);
}

Subtitle Generation

const result = await client.speechToText(videoAudio);

const subtitles = generateSRT(result.words);
saveFile('subtitles.srt', subtitles);

API Reference

Speech to Text API

View complete API documentation