Speaker Diarization
Identify who spoke when in your transcriptions by attaching speaker labels to each segment.
Speaker diarization tags every transcript segment with a speaker label (e.g. SPEAKER_00) so you can tell who spoke when. It runs as an opt-in extension of the standard transcription workflow — see the Transcription guide for the base submit/poll/retrieve pattern.
When to use it
- Multi-speaker interviews — podcasts, radio shows, journalistic interviews
- Panel discussions and meetings — distinguishing speakers in roundtable audio
- Call recordings — separating agent and customer in support calls
- Off by default — diarization adds ~1.2–1.8× processing overhead, so enable it only when you need speaker labels
1. Enable diarization
Add "diarize": true to a standard POST /transcribe request. Everything else about the workflow is unchanged:
curl -X POST "https://api.kiava.lesan.ai/transcribe" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_urls": ["https://example.com/interview.mp3"],
"language": "am",
"diarize": true
}'2. Tune with diarization_config
Pass an optional diarization_config object to pick a backend or bound the speaker count. All fields are optional; the defaults work for most audio.
curl -X POST "https://api.kiava.lesan.ai/transcribe" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_urls": ["https://example.com/panel.mp3"],
"language": "am",
"diarize": true,
"diarization_config": {
"backend": "pyannote",
"min_speakers": 2,
"max_speakers": 5
}
}'For long-form audio or batch processing at scale, switch to the nemo backend:
curl -X POST "https://api.kiava.lesan.ai/transcribe" \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"audio_urls": ["https://example.com/lecture.mp3"],
"language": "am",
"diarize": true,
"diarization_config": {
"backend": "nemo"
}
}'backend—"pyannote"(default) for best accuracy on conversational audio and interviews, or"nemo"for long-form audio and batch processing at scalemin_speakers— lower bound, integer 1–20. Omit (ornull) to auto-detectmax_speakers— upper bound, integer 1–20. Omit (ornull) to auto-detect- If both bounds are set,
min_speakersmust be ≤max_speakers
3. Read speaker labels from the transcript
Poll GET /transcribe/{job_id} and download the transcript URL as usual. When diarization was enabled, the transcript JSON adds a top-level speakers summary and a speaker field on every segment:
{
"job_id": "90310bc7-62a2-45c7-b92c-91fc5ccf3bcc",
"status": "COMPLETED",
"duration": 608.81,
"language": "am",
"speakers": {
"count": 2,
"labels": ["SPEAKER_00", "SPEAKER_01"]
},
"segments": [
{
"id": 0,
"start_time": "0.50",
"end_time": "2.10",
"type": "speech",
"text": "ጤና ይስጥልኝ",
"speaker": "SPEAKER_00"
},
{
"id": 1,
"start_time": "2.10",
"end_time": "5.40",
"type": "speech",
"text": "እንደምን አደሩ",
"speaker": "SPEAKER_01"
},
{
"id": 2,
"start_time": "5.40",
"end_time": "7.00",
"type": "music",
"text": "",
"speaker": null
}
],
"warnings": []
}Speaker IDs are formatted SPEAKER_XX (zero-padded) and assigned in order of first appearance in the audio. Non-speech segments — music, noise, silence — carry speaker: null. When diarization is not requested, both the speakers summary and the per-segment speaker field are absent from the response.
Graceful degradation
If diarization fails after the job is accepted (e.g. a backend timeout), the job still completes successfully with a plain transcript — the API does not mark it as failed. A message is appended to the top-level warnings array so your client can detect the degraded result:
{
"job_id": "90310bc7-62a2-45c7-b92c-91fc5ccf3bcc",
"status": "COMPLETED",
"segments": [ /* ...transcript segments without speaker labels... */ ],
"warnings": [
"Diarization failed: pyannote timeout after 120s. Transcript returned without speaker labels."
]
}Limits and performance
- Speaker count — 1 to 20 per file; audio with more distinct speakers may produce merged labels
- Processing overhead — roughly 1.2–1.8× the base transcription time, since diarization runs in parallel with ASR
- Scope — applies uniformly to all files in a batch; there is no per-file diarization toggle
- Languages — works with every supported ASR language (e.g.
am,ti) - File size and duration limits — same as base transcription (max 1 GB per file)
For the complete request and response schemas, see the API Reference.