Skip to main content
Version: 2.0.0

Speaker Diarization

The Speaker Diarization technology labels segments of the same voice(s) in a mono recording based on the individual speakers' voices. The technology is independent of language, domain and channel. The output of the technology is a list of time segments with speaker labels. It not only performs speaker segmentation, but can also detect technical signals or silence.

Demonstration of Speaker Diarization

Typical use cases

  • Preprocessing for other speech recognition technologies,
  • Labeling of parts of an utterance according to speakers,
  • Splitting mono phone call recordings into multiple channels,
  • Identifying how many speakers are speaking in the recording.

How does it work?

Speaker Diarization is based on the Speaker Identification technology. It consists of the following steps:

  • Filter out segments of silence and technical signals (Voice Activity Detection).
  • Split the voice segments into small chunks.
  • Create a voiceprint for each chunk.
  • Compute distance between the voiceprints of all the chunks.
  • Create groups of chunks according to distance using a clustering algorithm.
  • Complete and smooth out the final segmentation using a variational Bayes algorithm.

See the following image for better understanding:

Speaker Diarization workflow

Clustering algorithm

An important step in Speaker Diarization is voiceprint clustering. There are many algorithms for clustering and each one has its advantages and disadvantages. In the Phonexia Speaker Diarization XL5 model, we use the Agglomerative Hierarchial Clustering algorithm. It is a bottom-up approach to clustering data points. It starts with each data point as its own cluster and iteratively merges clusters based on a similarity metric until all data points belong to a single cluster or a stopping criterion is met.

Variation Bayes algorithm

The Variational Bayes (VB) algorithm for Speaker Diarization is a probabilistic method used to partition an audio recording into segments, each associated with a single speaker. It's based on Bayesian inference and seeks to estimate the posterior distribution over speaker identities for each time frame in the audio. The Variational Bayes algorithm offers a flexible and robust framework for Speaker Diarization by combining probabilistic modeling with Bayesian inference techniques, allowing for accurate segmentation of audio recordings with multiple speakers.

Scoring

The most common metric for evaluating the performance of Speaker Diarization systems is Diarization Error Rate (DER). It measures the overall error rate by calculating three types of errors:

  1. Missed speech segments,
  2. False alarms,
  3. Speaker confusion errors.

Missed Speech Segments: These are segments of speech in the reference (ground truth) segmentation that are not correctly detected by the diarization system. In other words, if a speaker's speech segment in the reference is not assigned to any speaker, it results in a missed speech segment error.

False Alarms: These are segments of speech detected by the diarization system that do not correspond to any speaker in the reference segmentation. In other words, if the diarization system falsely assigns speech to a speaker where there should be none according to the reference, it results in a false alarm error.

Speaker Confusion Errors: These occur when the diarization system incorrectly assigns speech segments to the wrong speaker. For example, if a speech segment from one speaker is incorrectly labeled as belonging to another speaker by the diarization system, it results in a speaker confusion error.

The diarization Error Rate (DER) is then calculated as the sum of these three error types, normalized by the total duration of the reference segmentation:

DER=Missed+FalseAlarms+ConfusionReferenceDuration100DER=\frac{Missed + FalseAlarms + Confusion}{ReferenceDuration} * 100

A lower DER indicates better performance, as it reflects fewer errors in the diarization output with respect to the reference segmentation.

note

The DER metric is not the only metric used to measure diarization accuracy. The other common metric is the Jaccard Error Rate (JER) which is based on the Jaccard similarity index.

Number of speakers

By default Phonexia Diarization automatically detects the number of speakers. But there is also the possibility to increase the accuracy of the system by providing information about the actual number of speakers at the input. For example, if users know that the number of speakers in the input recording is not more than N or the number is exactly N, they can specify this with the max_speakers or total_speakers input parameters.

FAQ

Why does the diarization system detect more speakers than there actually are?

This can happen for several reasons. Here are the most common:

  • The recording contains segments in which two or more speakers speak at the same time (crosstalks).
  • The audio channel of a single speaker changes over time.
tip

If you know the exact number of speakers in the recording, you can specify this with the total_speakers input parameter. If you don't know the exactu number of speakers but you know the maximum possible number of speakers, you can specify this by using the max_speakers parameter.

What is the typical DER for state-of-the-art diarization systems?

The accuracy of current systems is typically less than 10% DER, but depends on many factors such as number of speakers, number of crosstalks, etc.

What is the speed of the Speaker Diarization technology?

The speed on modern CPUs is about 20 times faster than real time (20 seconds of audio is processed in 1 second), but it depends on the amount of speech in the audio.