Version: 4.0.0

Speaker Diarization

The Phonexia Speaker Diarization technology enables users to distinguish between different speakers present in each channel of a recording, whether it's mono or stereo, by providing precise timestamps indicating when each speaker is active. This feature allows users to isolate and listen to individual speakers or further process specific speakers using other technologies.

Additionally, if the total number of speakers in the recording is unknown, the Speaker Diarization technology can automatically detect and provide this information.

This page explains how to use Phonexia Speaker Diarization in our web application. If you want to dive deeper into the inner workings of this technology, check out our detailed technical documentation.

Number of speakers

In the first dropdown menu, you can provide additional information to help the system generate more accurate results. Please note that only one option can be selected.

Choose the Automatic option if you don't have specific details about the number of speakers in the audio.
If you know the exact number of speakers, select Total number of speakers and input the value. This will allow the system to isolate each speaker’s segments and provide corresponding timestamps.

warning
Use this parameter only for mono channel recordings and only if you are absolutely certain of the exact number. The Speaker Diarization technology will strictly adhere to this input, treating it as the definitive count of speakers in the audio. It will not attempt to verify this number, so ensure its accuracy before proceeding.
If you're unsure about the exact number of speakers but have an estimate of the maximum number of speakers that could be present in the audio, input this number into Max number of speakers to guide the system in delivering more precise results.

Uploading files

Upload your files or create your own recordings by using the built-in recording feature. If you don't have your own files, you can use the provided Phonexia examples to explore how Speaker Diarization works.

Results

After uploading, your recordings will appear in the left panel. Once processing is complete, the results for each recording will be displayed in the right panel. In the upper right corner, you'll see the total number of speakers found in the recording (whether it's mono or stereo). Below the main player, you'll find multiple waveforms, each representing an individual speaker.

Further actions

After reviewing your results in the right panel, you can perform several actions:

Mute individual speakers or entire channels to play only the selected speakers or channels.
Download the recording, including only the chosen speakers.

Export formats

Once your results are ready, you can export them in various formats.

Speaker Diarization results can be exported for each file either individually or in bulk as a ZIP file. The available formats are .CSV, .XLSX and JSON. The individual files are named after the corresponding audio files and contain key information such as speaker IDs, channel number, and start times and end times of individual segments.

XLSX format

The .XLSX format provides a clear, comprehensive, and human-readable overview of the metadata. In this format, timestamps are presented in the format: HH:MM:SS.

Table showing speaker ID, channel information, and the timestamps of speech of each speaker.

CSV format

The .CSV format is well-suited for users who work with large datasets, as it facilitates sophisticated computational processing and filtering based on specific metadata criteria. Start time and end time of each segment are expressed in seconds in this format. The .CSV format uses UTF-8 encoding.

Speaker,Channel,Start_time,End_time
0,0,0.18,13.89
0,0,14.15,14.39
1,0,14.39,15.91
1,0,15.95,26.35
0,0,26.55,39.51
1,0,39.51,42.16
0,0,42.16,42.28
0,0,42.39,46.23
1,0,46.23,48.91

JSON format

Similarly as the .CSV format, the JSON representation is well-suited for users who work with large datasets, as it facilitates complex and easy computational processing and filtering based on specific metadata criteria.

{
  "channels": [
    {
      "channel_number": 0,
      "speakers_count": 2,
      "segments": [
        {
          "speaker_id": 0,
          "start_time": 1.51,
          "end_time": 13.99
        },
        {
          "speaker_id": 1,
          "start_time": 13.99,
          "end_time": 14.12
        },
        {
          "speaker_id": 1,
          "start_time": 14.63,
          "end_time": 25.51
        },
        {
          "speaker_id": 1,
          "start_time": 25.7,
          "end_time": 25.94
        }
      ]
    }
  ]
}

Number of speakers​

Uploading files​

Results​

Further actions​

Export formats​

XLSX format​

CSV format​

JSON format​