Speaker Diarization
This guide demonstrates how to perform Speaker Diarization with Phonexia Speech Platform 4. You can find a high-level description in the Speaker Diarization article.
For testing, we'll be using the following audio files. You can download them all together in the audio_files.zip archive.
filename | channels | number of speakers in each channel (channels are separated by comma) |
---|---|---|
Kathryn_Paula.wav | mono | 2 |
Laura_Harry_Veronika.wav | stereo | 1, 2 |
Laura_Ivy_Kathryn.wav | mono | 3 |
Veronika_Harry.wav | mono | 2 |
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speaker Diarization in your own projects.
Prerequisites
In the guide, we assume that the Speech Platform server is running on port
8000
of http://localhost
and a properly configured Speaker Diarization
microservice is available. Here's more information on how to install and start
the Speech Platform server and how to make the
microservice
available.
Environment Setup
We are using Python 3.9
and Python library requests 2.27
in this example.
You can install the requests
library with pip
as follows:
pip install requests~=2.27
Basic Speaker Diarization
To run Speaker Diarization for a single media file, you should start by sending
a POST
request to the
/api/technology/speaker-diarization
endpoint. In Python, you can do this as follows:
import requests
SPEECH_PLATFORM_SERVER = "http://localhost:8000"
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/speaker-diarization"
audio_path = "Kathryn_Paula.wav"
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
print(response.status_code) # Should print '202'
If the task has been successfully accepted, the 202
code will be returned
together with a unique task ID
in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
The URL for polling the result is returned in the X-Location
header.
Alternatively, you can assemble the polling URL on your own by appending a slash
(/
) and the task ID
to the endpoint URL.
import requests
import time
polling_url = response.headers["x-location"] # Use the `response` from the previous step
# Alternatively:
# polling_url = ENDPOINT_URL + "/" + response.json()["task"]["task_id"]
counter = 0
while counter < 100:
response = requests.get(polling_url)
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
counter += 1
time.sleep(5)
Once the polling finishes, data
will contain the latest response from the
server -- either the result of Speaker Diarization, or an error message with
details, in case processing was not able to finish properly. The technology
result can be accessed as data["result"]
, and for our sample audio, the result
should look as follows (result was shortened due to readability):
{
"channels": [
{
"channel_number": 0,
"speakers_count": 2,
"segments": [
{
"speaker_id": 0,
"start_time": 1.51,
"end_time": 13.99
},
{
"speaker_id": 1,
"start_time": 13.99,
"end_time": 14.12
},
{
"speaker_id": 1,
"start_time": 14.63,
"end_time": 25.51
},
{
"speaker_id": 1,
"start_time": 25.7,
"end_time": 25.94
},
{
"speaker_id": 0,
"start_time": 1.51,
"end_time": 13.99
},
...
]
}
]
}
In case you are processing multi-channel media files, you will obtain an
independent Speaker Diarization result for each channel in the channels
list.
Speaker Diarization with Parameters
If you want to have more control over the technology result, you can use the
config
request body parameter, in which you can specify one of the mutually
exclusive parameters max_speakers
, and total_speakers
. The max_speakers
parameter defines the upper boundary of how many speakers may be identified in
the audio, potentially making the result more accurate. If no value of
max_speakers
is set, the technology uses the default value of 100. On the
other hand, the total_speakers
parameter sets the exact value of how many
speakers must be identified. Use this parameter only if you are sure about the
number of speakers in the audio, because the technology always diarizes the
recording accordingly, even if the actual number of speakers is different.
You can use the config
payload in the POST
request as follows:
payload = {"config": json.dumps({"max_speakers": 5})}
# or
# payload = {"config": json.dumps({"total_speakers": 2})}
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
data=payload,
files=files,
)
You can follow the polling steps and parsing of the results as was demonstrated in the Basic Speaker Diarization section.
Full Python code
Here is the full example on how to run the Speaker Diarization technology with no parameters. The code is slightly adjusted and wrapped into functions for better readability.
import json
import requests
import time
SPEECH_PLATFORM_SERVER = "http://localhost:8000" # Replace with your actual server URL
ENDPOINT_URL = f"{SPEECH_PLATFORM_SERVER}/api/technology/speaker-diarization"
def poll_result(polling_url: str, sleep: int = 5):
while True:
response = requests.get(polling_url)
response.raise_for_status()
data = response.json()
task_status = data["task"]["state"]
if task_status in {"done", "failed", "rejected"}:
break
time.sleep(sleep)
return response
def run_speaker_diarization(audio_path: str):
with open(audio_path, mode="rb") as file:
files = {"file": file}
response = requests.post(
url=ENDPOINT_URL,
files=files,
)
response.raise_for_status()
polling_url = response.headers["x-location"]
response_result = poll_result(polling_url)
return response_result.json()
filenames = [
"Kathryn_Paula.wav",
"Laura_Harry_Veronika.wav",
"Laura_Ivy_Kathryn.wav",
"Veronika_Harry.wav",
]
for filename in filenames:
print(f"Runnning Speaker Diarization for file {filename}.")
data = run_speaker_diarization(filename)
result = data["result"]
print(f"{json.dumps(result, indent=2)}\n")