Speaker diarization
Phonexia speaker-diarization is a tool for detecting individual speakers in an audio and creating a segmentation of the audio based on who of the detected speakers is speaking in the segment. It is a language, domain and channel-independent technology. Phonexia speaker-diarization is useful for labeling the parts of the utterance according to the speakers, identifying how many speakers are speaking in the recording, or preprocessing for other speech recognition technologies. To learn more, visit the technology's home page.
Installation
- Docker image
- Docker compose
- Helm chart
Getting the image
You can easily obtain the speaker diarization image from docker hub. There are 2 variants of the image. For CPU and for GPU.
- CPU
- GPU
You can get the CPU image by specifying a direct version in the tag (e.g. 1.0.0
) or latest
for the latest image:
docker pull phonexia/speaker-diarization:latest
The GPU images has a suffix -gpu
in the image tag (e.g. 1.0.0-gpu
) or you can use a tag gpu
to get the latest version. In these images, the most computationally demanding tasks are handled by the GPU. The prerequisites are NVIDIA GPU with drivers and nvidia-container-toolkit
installed (see the Installing the NVIDIA Container Toolkit for more info).
docker pull phonexia/speaker-diarization:gpu
Running the image
You can start the microservice and list all the supported options by running:
docker run --rm -it phonexia/speaker-diarization:latest --help
The output should look like this:
Usage: speaker-diarization [OPTIONS]
Options:
-h,--help Print this help message and exit
-m,--model file REQUIRED (Env:PHX_MODEL_PATH)
Path to a model file.
-k,--license_key string REQUIRED (Env:PHX_LICENSE_KEY)
License key.
-a,--listening_address address [[::]] (Env:PHX_LISTENING_ADDRESS)
Address on which the server will be listening. Address '[::]' also accepts IPv4 connections.
-p,--port number [8080] (Env:PHX_PORT)
Port on which the server will be listening.
-l,--log_level level:{error,warning,info,debug,trace} [info] (Env:PHX_LOG_LEVEL)
Logging level. Possible values: error, warning, info, debug, trace.
--keepalive_time_s number:[0, max_int] [60] (Env:PHX_KEEPALIVE_TIME_S)
Time between 2 consecutive keep-alive messages, that are sent if there is no activity from the client. If set to 0, the default gRPC configuration (2hr) will be set (note, that this may get the microservice into unresponsive state).
--keepalive_timeout_s number:[1, max int] [20] (Env:PHX_KEEPALIVE_TIMEOUT_S)
Time to wait for keep alive acknowledgement until the connection is dropped by the server.
--device TEXT:{cpu,cuda} [cpu] (Env:PHX_DEVICE)
Compute device used for inference.
--num_threads_per_instance NUM (Env:PHX_NUM_THREADS_PER_INSTANCE)
Number of threads per instance (applies to CPU processing only). Use N CPU threads in the microservice for each request. Number of threads is automatically detected if set to 0.
--num_instances_per_device NUM:UINT > 0 (Env:PHX_NUM_INSTANCES_PER_DEVICE)
Number of instances per device (both CPU and GPU processing). Microservice can process requests concurrently if value is >1.
--device_index ID (Env:PHX_DEVICE_INDEX)
Device identifier
The model and license_key options are required. To obtain the model and license, contact Phonexia.
You can specify the options either via command line arguments or via environmental variables.
- CPU
- GPU
Run the container with the mandatory parameters:
docker run --rm -it -p 8080:8080 -v /opt/phx/models:/models phonexia/speaker-diarization:latest --model /models/speaker_diarization-xl-5.1.0.model --license_key ${license-key}
To run GPU images you will need to make the GPU available inside the docker container. This is done by --gpus
parameter (typically --gpus all
), see the Access an NVIDIA GPU chapter for more info, for example:
Run the container with the mandatory parameters:
docker run --rm -it --gpus all -v /opt/phx/models:/models -p 8080:8080 phonexia/speaker-diarization:gpu --model /models/speaker_diarization-xl-5.1.0.model --license_key ${license-key}
Replace the /opt/phx/models
, speaker_diarization-xl-5.1.0.model
and license-key
with the corresponding values.
With this command, the container will start, and the microservice will be listening on port 8080 on localhost.
Docker compose
There are 2 variants of the docker image. For CPU and for GPU. Create a docker-compose.yml
file for the specific variant:
- CPU
- GPU
version: '3'
services:
speaker-diarization:
image: phonexia/speaker-diarization:latest
environment:
- PHX_MODEL_PATH=/models/speaker_diarization-xl-5.1.0.model
- PHX_LICENSE_KEY=<license-key>
ports:
- 8080:8080
volumes:
- ./models:/models/
version: '3'
services:
speaker-diarization:
image: phonexia/speaker-diarization:gpu
environment:
- PHX_MODEL_PATH=/models/speaker_diarization-xl-5.1.0.model
- PHX_LICENSE_KEY=<license-key>
ports:
- 8080:8080
volumes:
- ./models:/models/
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Create a models
folder in the same directory as the docker-compose.yml
file and place a model file in it. Replace <license-key>
with your license key and speaker_diarization-xl-5.1.0.model
with the actual name of a model.
The model and license_key options are required. To obtain the model and license, contact Phonexia.
You can than start the microservice by running:
$ docker compose up
The optimal way for large scale deployment is by using container orchestration system. Take a look at out Helm chart deployment page for deployment using Kubernetes.
Performance optimization
The speaker-diarization
microservice supports GPU acceleration.
In the docker images with GPU support, the GPU acceleration is enabled by default. While GPU acceleration will be used primarily, certain processing tasks will still rely on CPU resources.
For better performance, multiple microservices can share a GPU unit. The number of microservice instances per GPU depends on the hardware used.
Microservice communication
gRPC API
For communication, our microservices use gRPC, which is a high-performance, open-source Remote
Procedure Call (RPC
) framework that enables efficient communication between distributed systems using a variety of programming languages. We use an interface definition language to specify a common interface and contracts between components. This is primarily achieved by specifying methods with parameters and return types.
Take a look at our gRPC API documentation. The speaker-diarization microservice defines a SpeakerDiarization
service with remote procedure called Diarize
. This procedure accepts an argument (also referred to as "message") called DiarizeRequest
, which contains the audio as an array of bytes, together with an optional config argument.
This DiarizeRequest
argument is streamed, meaning that it may be received in multiple requests, each containing a part of the audio. If specified, the optional config argument must be sent only with the first request. Once all the requests have been received and processed, the Diarize
procedure returns a message called DiarizeResponse
which consists of the number of detected speakers, processed audio length and array of detected segments in the audio. The segments than consist of speaker identifier, start time and end time of the segment.
Connecting to microservice
There are multiple ways how you can communicate with our microservices.
- Generated library
- Python client
- grpcurl client
- GUI clients
Using generated library
The most common way how to communicate with the microservices is via a programming language using a generated library.
Python library
If you use Python as your programming language, you can use our official gRPC Python library.
To install the package using pip
, run:
pip install phonexia-grpc
You can then import:
- Specific libraries for each microservice that provide the message wrappers.
- stubs for the
gRPC
clients.
from phonexia.grpc.common.core_pb2 import Audio, RawAudioConfig, TimeRange
from phonexia.grpc.technologies.speaker_diarization.v1.speaker_diarization_pb2 import DiarizeRequest, DiarizeResponse
from phonexia.grpc.technologies.speaker_diarization.v1.speaker_diarization_pb2_grpc import SpeakerDiarizationStub
Generate library for programming language of your choice
For the definition of microservice interfaces, we use the standard way of protocol buffers. The services
, together with the procedures
and messages
that they expose, are defined in the so-called proto
files.
The proto
files can be used to generate client libraries in many programming languages. Take a look at protobuf tutorials for how to get started with generating the library in the languages of your choice using the protoc
tool.
You can find the proto
files developed by Phonexia in this repository.
Using existing clients
Phonexia Python client
The easiest way to get started with testing is to use our simple Python client. To get it, run:
pip install phonexia-speaker-diarization-client
After the successful installation, run the following command to see the client options:
speaker_diarization_client --help
grpcurl client
If you need a simple tool for testing the microservice on the command line, you can use grpcurl. This tool can serialize and send a request for you, if you provide the request body in JSON format and specify the endpoint.
The audio content in the body must be encoded in Base64
. The request also cannot exceed 4 MiB, therefore it's necessary to split bigger files to multiple chunks. You can use jq
tool to generate JSON input for grpcurl
.
Now you can make the request. The microservice supports reflection. That means that you don't need to know the API in advance to make a request. Replace ${path_to_audio_file}
with corresponding value.
base64 -w 4000000 ${path_to_audio_file} | jq -cnR '{"audio":{"content":inputs}}' | grpcurl -plaintext -use-reflection -d @ localhost:8080 phonexia.grpc.technologies.speaker_diarization.v1.SpeakerDiarization/Diarize
The grpcurl
automatically serializes the response to this request into JSON including the diarized segments.
GUI clients
If you'd prefer to use a GUI client like Postman or Warthog to test the microservice, take a look at the GUI Client page in our documentation. Note that you will still need to convert the audio into the Base64 format manually as those tools do not support it by default either.
Further links
- Maintained by Phonexia
- Contact us via e-mail, or open a ticket at the Phonexia Service Desk
- File an issue
- See list of licenses
- See the terms of use
Versioning
We use Semantic Versioning.