Skip to main content
Version: 3.6.0

Speaker Diarization

This guide demonstrates how to perform Speaker Diarization with Phonexia Speech Platform 4. You can find a high-level description in the Speaker Diarization article.

For testing, we'll be using the following audio files. You can download them all together in the archive.

filenamechannelsnumber of speakers in each channel (channels are separated by comma)
Laura_Harry_Veronika.wavstereo1, 2

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speaker Diarization in your own projects.


In the guide, we assume that the Speech Platform server is running on port 8000 of http://localhost and a properly configured Speaker Diarization microservice is available. Here's more information on how to install and start the Speech Platform server and how to make the microservice available.

Environment Setup

We are using Python 3.9 and Python library requests 2.27 in this example. You can install the requests library with pip as follows:

pip install requests~=2.27

Basic Speaker Diarization

To run Speaker Diarization for a single media file, you should start by sending a POST request to the /api/technology/speaker-diarization endpoint. In Python, you can do this as follows:

import requests

SPEECH_PLATFORM_SERVER = "http://localhost:8000"
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/speaker-diarization"

audio_path = "Kathryn_Paula.wav"

with open(audio_path, mode="rb") as file:
files = {"file": file}
response =
print(response.status_code) # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

The URL for polling the result is returned in the X-Location header. Alternatively, you can assemble the polling URL on your own by appending a slash (/) and the task ID to the endpoint URL.

import requests
import time

polling_url = response.headers["x-location"] # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]

counter = 0
while counter < 100:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
counter += 1

Once the polling finishes, data will contain the latest response from the server -- either the result of Speaker Diarization, or an error message with details, in case processing was not able to finish properly. The technology result can be accessed as data["result"], and for our sample audio, the result should look as follows (result was shortened due to readability):

"channels": [
"channel_number": 0,
"speakers_count": 2,
"segments": [
"speaker_id": 0,
"start_time": 1.51,
"end_time": 13.99
"speaker_id": 1,
"start_time": 13.99,
"end_time": 14.12
"speaker_id": 1,
"start_time": 14.63,
"end_time": 25.51
"speaker_id": 1,
"start_time": 25.7,
"end_time": 25.94
"speaker_id": 0,
"start_time": 1.51,
"end_time": 13.99

In case you are processing multi-channel media files, you will obtain an independent Speaker Diarization result for each channel in the channels list.

Speaker Diarization with Parameters

If you want to have more control over the technology result, you can use the config request body parameter, in which you can specify one of the mutually exclusive parameters max_speakers, and total_speakers. The max_speakers parameter defines the upper boundary of how many speakers may be identified in the audio, potentially making the result more accurate. If no value of max_speakers is set, the technology uses the default value of 100. On the other hand, the total_speakers parameter sets the exact value of how many speakers must be identified. Use this parameter only if you are sure about the number of speakers in the audio, because the technology always diarizes the recording accordingly, even if the actual number of speakers is different.

You can use the config payload in the POST request as follows:

payload = {"config": json.dumps({"max_speakers": 5})}

# or
# payload = {"config": json.dumps({"total_speakers": 2})}

with open(audio_path, mode="rb") as file:
files = {"file": file}
response =

You can follow the polling steps and parsing of the results as was demonstrated in the Basic Speaker Diarization section.

Full Python code

Here is the full example on how to run the Speaker Diarization technology with no parameters. The code is slightly adjusted and wrapped into functions for better readability.

import json
import requests
import time

SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/speaker-diarization"

def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
return response

def run_speaker_diarization(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response =
polling_url = response.headers["x-location"]
response_result = poll_result(polling_url)
return response_result.json()

filenames = [

for filename in filenames:
print(f"Runnning Speaker Diarization for file {filename}.")
data = run_speaker_diarization(filename)
result = data["result"]
print(f"{json.dumps(result, indent=2)}\n")