Version: 2026.07.0

Speaker Diarization

The Speaker Diarization technology labels segments of the same voice(s) in a mono recording based on individual speakers' voices. It is independent of language, domain, and channel. The output is a list of time segments with speaker labels. Besides speaker segmentation, it can also detect technical signals and silence.

Demonstration of Speaker Diarization

Typical use cases

Preprocessing audio for other speech recognition technologies.
Labeling parts of an utterance according to speakers.
Splitting mono phone call recordings into multiple channels.
Identifying how many speakers are present in the recording.

How does it work?

Speaker Diarization is based on the Speaker Identification technology. It consists of the following steps:

Filtering out segments of silence and technical signals (Voice Activity Detection).
Splitting the voice segments into small chunks.
Creating a voiceprint for each chunk.
Computing distance between the voiceprints of all the chunks.
Creating groups of chunks according to distance using a clustering algorithm.
Completing and smoothing out the final segmentation using a variational Bayes algorithm.

See the following image for better understanding:

Speaker Diarization workflow

Clustering algorithm

An important step in Speaker Diarization is voiceprint clustering. There are many algorithms for clustering, each with its own advantages and disadvantages. In the Phonexia Speaker Diarization XL5 model, we use the Agglomerative Hierarchical Clustering algorithm. This is a bottom-up approach to clustering data points. It starts with each data point as its own cluster and iteratively merges clusters based on a similarity metric until all data points belong to a single cluster or a stopping criterion is met.

Variation Bayes algorithm

The Variational Bayes (VB) algorithm for Speaker Diarization is a probabilistic method used to partition an audio recording into segments, each associated with a single speaker. Based on Bayesian inference, it seeks to estimate the posterior distribution over speaker identities for each time frame in the audio. The Variational Bayes algorithm offers a flexible and robust framework for Speaker Diarization by combining probabilistic modeling with Bayesian inference techniques, allowing for accurate segmentation of audio recordings with multiple speakers.

Scoring

The most common metric for evaluating the performance of Speaker Diarization systems is Diarization Error Rate (DER). It measures the overall error rate by calculating three types of errors:

Missed speech segments.
False alarms.
Speaker confusion errors.

Missed Speech Segments: These are segments of speech in the reference (ground truth) segmentation that are not correctly detected by the diarization system. In other words, if a speaker's speech segment in the reference is not assigned to any speaker, it results in a missed speech segment error.

False Alarms: These are segments of speech detected by the diarization system that do not correspond to any speaker in the reference segmentation. In other words, if the diarization system falsely assigns speech to a speaker where there should be none according to the reference, it results in a false alarm error.

Speaker Confusion Errors: These occur when the diarization system incorrectly assigns speech segments to the wrong speaker. For example, if a speech segment from one speaker is incorrectly labeled as belonging to another speaker by the diarization system, it results in a speaker confusion error.

The diarization Error Rate (DER) is then calculated as the sum of these three error types, normalized by the total duration of the reference segmentation:

DER=\frac{Missed + FalseAlarms + Confusion}{ReferenceDuration} * 100

A lower DER indicates better performance, as it reflects fewer errors in the diarization output with respect to the reference segmentation.

note

The DER metric is not the only metric used to measure diarization accuracy. The other common metric is the Jaccard Error Rate (JER) which is based on the Jaccard similarity index.

Number of speakers

By default, Phonexia Diarization automatically detects the number of speakers. However, the system's accuracy can be enhanced by inputting the actual number of speakers, which was not possible in the previous version, XL4. For example, if users know that the number of speakers in the input recording is not more than N or the number is exactly N, they can specify this using the max_speakers or total_speakers input parameters. This new feature can help improve the accuracy in more challenging cases.

FAQ

Why is the number of speakers detected not precise enough?

If the number of speakers found is too high, it might be due to the fact that the Speaker Diarization might be over-sensitive to some acoustic inputs.

If you know how many speakers are in the audio, set the Total Number of Speakers to this number. If you assume, there is a maximum speaker’s number, but you don’t know the exact number, use the Maximum Number of Speakers input. This should help you avoid false occurrences.
If the number is too low, it might be caused by using television, radio or YouTube data where fast or unnatural speech mixed with music might be evaluated as one speaker only. Please note that diarization has been created to work best on natural spontaneous conversations such as telephone calls.

What is the processing speed of the Speaker Diarization?

On modern CPUs, the processing speed is approximately 20× faster than real-time, meaning 20 seconds of audio can be processed in just 1 second. This speed may vary depending on the amount of speech in the audio.

What can I do to improve processing speed?

If you're using media files as input, the system first extracts a voiceprint, which is the most time-consuming part of the process. To speed things up, we recommend running the technology on a GPU if possible.

Typical use cases​

How does it work?​

Clustering algorithm​

Variation Bayes algorithm​

Scoring​

Number of speakers​

FAQ​

Why is the number of speakers detected not precise enough?​

What is the processing speed of the Speaker Diarization?​

What can I do to improve processing speed?​