Skip to main content
Version: 4.0.0-rc1

Voice activity detection

Phonexia voice-activity-detection microservice specializes in identifying segments within recordings that capture voice or speech. To learn more, visit the technology's home page.

Installation

Getting the image

You can easily obtain the voice activity detection image from docker hub. There are 2 variants of the image. For CPU and for GPU.

You can get the CPU image by specifying a direct version in the tag (e.g. 1.0.0) or latest for the latest image:

docker pull phonexia/voice-activity-detection:latest

Running the image

You can start the microservice and list all the supported options by running:

docker run --rm -it phonexia/voice-activity-detection:latest --help

The output should look like this:


voice-activity-detection [OPTIONS]


OPTIONS:
-h, --help Print this help message and exit
-m, --model file REQUIRED (Env:PHX_MODEL_PATH)
Path to a model file.
-k, --license_key string REQUIRED (Env:PHX_LICENSE_KEY)
License key.
-a, --listening_address address [[::]] (Env:PHX_LISTENING_ADDRESS)
Address on which the server will be listening. Address '[::]'
also accepts IPv4 connections.
-p, --port number [8080] (Env:PHX_PORT)
Port on which the server will be listening.
-l, --log_level level:{error,warning,info,debug,trace} [info] (Env:PHX_LOG_LEVEL)
Logging level. Possible values: error, warning, info, debug,
trace.
--keepalive_time_s number:[0, max_int] [60] (Env:PHX_KEEPALIVE_TIME_S)
Time between 2 consecutive keep-alive messages, that are sent if
there is no activity from the client. If set to 0, the default
gRPC configuration (2hr) will be set (note, that this may get the
microservice into unresponsive state).
--keepalive_timeout_s number:[1, max int] [20] (Env:PHX_KEEPALIVE_TIMEOUT_S)
Time to wait for keep alive acknowledgement until the connection
is dropped by the server.
--device TEXT:{cpu,cuda} [cpu] (Env:PHX_DEVICE)
Compute device used for inference.
--num_threads_per_instance NUM (Env:PHX_NUM_THREADS_PER_INSTANCE)
Number of threads per instance (applies to CPU processing only).
Use N CPU threads in the microservice for each request. Number of
threads is automatically detected if set to 0.
--num_instances_per_device NUM:UINT > 0 (Env:PHX_NUM_INSTANCES_PER_DEVICE)
Number of instances per device (both CPU and GPU processing).
Microservice can process requests concurrently if value is >1.
--device_index ID (Env:PHX_DEVICE_INDEX)
Device identifier
--profile TEXT:{speech_to_text,voice_biometrics} [speech_to_text] (Env:PHX_PROFILE)
Specify the profile to use for Voice Activity Detection. Choose
'speech_to_text' to use configuration for speech-to-text
scenarios or 'voice_biometrics' to use configuration for voice
activity detection used in voice biometric technologies.
note

The model and license_key options are required. To obtain the model and license, contact Phonexia.

You can specify the options either via command line arguments or via environmental variables.

Notice the --profile option. This option allows you to select a configuration profile used by other Phonexia technologies. Voice activity detection than produces the same speech segmentation as in these technologies. The default value, speech_to_text, corresponds to the profile optimized for the Speech-to-Text technology, while the voice_biometrics profile is tailored for biometric technologies, such as Speaker Identification.

Run the container with the mandatory parameters:

docker run --rm -it -v /opt/phx/models:/models -p 8080:8080 /phonexia/voice-activity-detection:latest --model /models/generic-3.2.0.model --license_key ${license-key}

Replace the /opt/phx/models, generic-3.2.0.model and license-key with the corresponding values.

With this command, the container will start, and the microservice will be listening on port 8080 on localhost.

Performance optimization

The voice-activity-detection microservice supports GPU acceleration.

In the docker images with GPU support, the GPU acceleration is enabled by default. While GPU acceleration will be used primarily, certain processing tasks will still rely on CPU resources.

For better performance, multiple microservices can share a GPU unit. The number of microservice instances per GPU depends on the hardware used.

Microservice communication

gRPC API

For communication, our microservices use gRPC, which is a high-performance, open-source Remote Procedure Call (RPC) framework that enables efficient communication between distributed systems using a variety of programming languages. We use an interface definition language to specify a common interface and contracts between components. This is primarily achieved by specifying methods with parameters and return types.

Take a look at our gRPC API documentation. The voice-activity-detection microservice defines a VoiceActivityDetection service with remote procedure called Detect. The Detect procedure accepts an argument (also referred to as "message") called DetectRequest, which contains the audio as an array of bytes.

This DetectRequest argument is streamed, meaning that it may be received in multiple requests, each containing a part of the audio. Once all requests have been received and processed, the Detect procedure returns a message called DetectResponse which consists of the resulting segmentation.

Connecting to microservice

There are multiple ways how you can communicate with our microservices.

Phonexia Python client

The easiest way to get started with testing is to use our simple Python client. To get it, run:

pip install phonexia-voice-activity-detection-client

After the successful installation, run the following command to see the client options:

voice_activity_detection_client --help

Versioning

We use Semantic Versioning.