Speaker Diarization

Identify who spoke when in your transcriptions by attaching speaker labels to each segment.

Speaker diarization tags every transcript segment with a speaker label (e.g. SPEAKER_00) so you can tell who spoke when. It runs as an opt-in extension of the standard transcription workflow — see the Transcription guide for the base submit/poll/retrieve pattern.

When to use it

  • Multi-speaker interviews — podcasts, radio shows, journalistic interviews
  • Panel discussions and meetings — distinguishing speakers in roundtable audio
  • Call recordings — separating agent and customer in support calls
  • Off by default — diarization adds ~1.2–1.8× processing overhead, so enable it only when you need speaker labels

1. Enable diarization

Add "diarize": true to a standard POST /transcribe request. Everything else about the workflow is unchanged:

curl -X POST "https://api.kiava.lesan.ai/transcribe" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_urls": ["https://example.com/interview.mp3"],
    "language": "am",
    "diarize": true
  }'

2. Tune with diarization_config

Pass an optional diarization_config object to pick a backend or bound the speaker count. All fields are optional; the defaults work for most audio.

curl -X POST "https://api.kiava.lesan.ai/transcribe" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_urls": ["https://example.com/panel.mp3"],
    "language": "am",
    "diarize": true,
    "diarization_config": {
      "backend": "pyannote",
      "min_speakers": 2,
      "max_speakers": 5
    }
  }'

For long-form audio or batch processing at scale, switch to the nemo backend:

curl -X POST "https://api.kiava.lesan.ai/transcribe" \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_urls": ["https://example.com/lecture.mp3"],
    "language": "am",
    "diarize": true,
    "diarization_config": {
      "backend": "nemo"
    }
  }'
  • backend"pyannote" (default) for best accuracy on conversational audio and interviews, or "nemo" for long-form audio and batch processing at scale
  • min_speakers — lower bound, integer 1–20. Omit (or null) to auto-detect
  • max_speakers — upper bound, integer 1–20. Omit (or null) to auto-detect
  • If both bounds are set, min_speakers must be ≤ max_speakers

3. Read speaker labels from the transcript

Poll GET /transcribe/{job_id} and download the transcript URL as usual. When diarization was enabled, the transcript JSON adds a top-level speakers summary and a speaker field on every segment:

json
{
  "job_id": "90310bc7-62a2-45c7-b92c-91fc5ccf3bcc",
  "status": "COMPLETED",
  "duration": 608.81,
  "language": "am",
  "speakers": {
    "count": 2,
    "labels": ["SPEAKER_00", "SPEAKER_01"]
  },
  "segments": [
    {
      "id": 0,
      "start_time": "0.50",
      "end_time": "2.10",
      "type": "speech",
      "text": "ጤና ይስጥልኝ",
      "speaker": "SPEAKER_00"
    },
    {
      "id": 1,
      "start_time": "2.10",
      "end_time": "5.40",
      "type": "speech",
      "text": "እንደምን አደሩ",
      "speaker": "SPEAKER_01"
    },
    {
      "id": 2,
      "start_time": "5.40",
      "end_time": "7.00",
      "type": "music",
      "text": "",
      "speaker": null
    }
  ],
  "warnings": []
}

Speaker IDs are formatted SPEAKER_XX (zero-padded) and assigned in order of first appearance in the audio. Non-speech segments — music, noise, silence — carry speaker: null. When diarization is not requested, both the speakers summary and the per-segment speaker field are absent from the response.

Graceful degradation

If diarization fails after the job is accepted (e.g. a backend timeout), the job still completes successfully with a plain transcript — the API does not mark it as failed. A message is appended to the top-level warnings array so your client can detect the degraded result:

json
{
  "job_id": "90310bc7-62a2-45c7-b92c-91fc5ccf3bcc",
  "status": "COMPLETED",
  "segments": [ /* ...transcript segments without speaker labels... */ ],
  "warnings": [
    "Diarization failed: pyannote timeout after 120s. Transcript returned without speaker labels."
  ]
}

Limits and performance

  • Speaker count — 1 to 20 per file; audio with more distinct speakers may produce merged labels
  • Processing overhead — roughly 1.2–1.8× the base transcription time, since diarization runs in parallel with ASR
  • Scope — applies uniformly to all files in a batch; there is no per-file diarization toggle
  • Languages — works with every supported ASR language (e.g. am, ti)
  • File size and duration limits — same as base transcription (max 1 GB per file)

For the complete request and response schemas, see the API Reference.