Skip to main content
Version: 4.0.0-rc1

Enhanced Speech To Text Built On Whisper

Phonexia enhanced-speech-to-text-built-on-whisper is a tool for transcribing speech from audio recordings into written text. This tool uses custom voice activity detection for better performance. To learn more, visit the technology's home page.

Installation

Getting the image

You can easily obtain the whisper image from docker hub. There are 2 variants of the image. For CPU and for GPU.

You can get the CPU image by specifying a direct version in the tag (e.g. 1.0.0) or latest for the latest image:

docker pull phonexia/enhanced-speech-to-text-built-on-whisper:latest

Running the image

You can start the microservice and list all the supported options by running:

docker run --rm -it phonexia/enhanced-speech-to-text-built-on-whisper:latest --help

The output should look like this:


enhanced-speech-to-text-built-on-whisper [OPTIONS]


OPTIONS:
-h, --help Print this help message and exit
-m, --model file REQUIRED (Env:PHX_MODEL_PATH)
Path to a model file.
-k, --license_key string REQUIRED (Env:PHX_LICENSE_KEY)
License key.
-a, --listening_address address [[::]] (Env:PHX_LISTENING_ADDRESS)
Address on which the server will be listening. Address '[::]'
also accepts IPv4 connections.
-p, --port number [8080] (Env:PHX_PORT)
Port on which the server will be listening.
-l, --log_level level:{error,warning,info,debug,trace} [info] (Env:PHX_LOG_LEVEL)
Logging level. Possible values: error, warning, info, debug,
trace.
--keepalive_time_s number:[0, max_int] [60] (Env:PHX_KEEPALIVE_TIME_S)
Time between 2 consecutive keep-alive messages, that are sent if
there is no activity from the client. If set to 0, the default
gRPC configuration (2hr) will be set (note, that this may get the
microservice into unresponsive state).
--keepalive_timeout_s number:[1, max int] [20] (Env:PHX_KEEPALIVE_TIMEOUT_S)
Time to wait for keep alive acknowledgement until the connection
is dropped by the server.
--device TEXT:{cpu,cuda} [cpu] (Env:PHX_DEVICE)
Compute device used for inference
--num_threads_per_instance NUM [0] (Env:PHX_NUM_THREADS_PER_INSTANCE)
Number of threads per instance (applies only to CPU processing
only). Microservice use N CPU threads for each request. Number of
threads is automatically detected if set to 0.
--num_instances_per_device NUM:UINT > 0 [1] (Env:PHX_NUM_INSTANCES_PER_DEVICE)
Number of instances per device. Microservice can process requests
concurrently if value is >1. Maximum number of concurrently
running requests is (num_instances_per_device *
device_indices.size())
--device_indices INT [[0]] ... (Env:PHX_DEVICE_INDICES)
List of devices to run the model on. Microservice can process
requests concurrently if number of devices is >1. Maximum number
of concurrently running requests is (num_instances_per_device *
device_indices.size()
--use_vad BOOLEAN [1] (Env:PHX_USE_VAD)
Whether to use Voice Activity Detection (VAD) filtering
--seed UINT (Env:PHX_SEED)
Seed for random generator
--beam_size UINT (Env:PHX_BEAM_SIZE)
Override the default beam size for the model. Beam size controls
the number of alternative paths that are explored when generating
the output. Setting the beam size to a low value may reduce the
time complexity at cost of smaller word accuracy.
note

The model and license_key options are required. To obtain the model and license, contact Phonexia.

You can specify the options either via command line arguments or via environmental variables.

Run the container with the mandatory parameters:

docker run --rm -it -v /opt/phx/models:/models -p 8080:8080 phonexia/enhanced-speech-to-text-built-on-whisper:latest --model /models/enhanced_speech_to_text_built_on_whisper-large_v2-1.0.1.model --license_key ${license-key}

Replace the /opt/phx/models, enhanced_speech_to_text_built_on_whisper-large_v2-1.0.1.model and license-key with the corresponding values.

With this command, the container will start, and the microservice will be listening on port 8080 on localhost.

Performance optimization

The enhanced-speech-to-text-built-on-whisper microservice supports GPU acceleration and vertical scaling to optimize resource utilization and to enhance performance.

GPU acceleration is enabled by default in the GPU-enabled image. This image requires a CUDA-enabled GPU in the system. While primarily GPU acceleration will be utilized, specific processing tasks will still rely on CPU resources.

Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The microservice supports the following parameters:

  • num_instances_per_device: Specifies the number of parallel transcriber instances to run on a single device (CPU or GPU). This value is applied consistently across all available devices.
  • num_threads_per_instance: Defines the number of CPU threads to utilize per transcriber instance.
  • device_indices: Specifies the indices of CPU or GPU devices where transcriber instances should run.

The total number of concurrent transcriber instances is determined by multiplying num_instances_per_device by the number of devices specified by device_indices. The resulting value represents the maximum number of transcription requests that the microservice can process simultaneously.

Finding optimal scaling parameters

The primary limiting factor when scaling, is the memory bandwidth. Whisper models, with their large sizes, require significant data transfers between the CPU and RAM, or between the GPU and Video RAM (VRAM) in the case of GPU acceleration. Increasing parallelization per device (by adjusting num_instances_per_device or num_threads_per_instance) will eventually saturate the memory bandwidth, and above a certain level of parallelization, diminishing performance gains will be achieved.

CPU processing

The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.

For latency prioritization, set num_instances_per_device to 1 and focus on tuning num_threads_per_instance. If throughput is the priority, adjust both num_instances_per_device and num_threads_per_instance to find the optimal utilization.

GPU processing

With GPU processing enabled, the most computationally demanding tasks are handled by the GPU. Therefore, setting num_threads_per_instance to 1 is sufficient, as it only controls CPU parallelization.

To achieve minimal latency, set num_instances_per_device to 1. This prevents multiple instances from competing for the same GPU resources.

For enhanced throughput, gradually increment num_instances_per_device while observing the throughput. Once the throughput plateaus or decreases, the optimal balance between latency and throughput has been reached. Based on our experiments, setting num_instances_per_device to 3 provides the best performance in terms of throughput regardless of model size and GPU.

Microservice communication

gRPC API

For communication, our microservices use gRPC, which is a high-performance, open-source Remote Procedure Call (RPC) framework that enables efficient communication between distributed systems using a variety of programming languages. We use an interface definition language to specify a common interface and contracts between components. This is primarily achieved by specifying methods with parameters and return types.

Take a look at our gRPC API documentation. The enhanced-speech-to-text-built-on-whisper microservice defines a SpeechToText service with remote procedures called Transcribe and ListSupportedLanguages. The Transcribe procedure accepts an argument (also referred to as "message") called TranscribeRequest, which contains the audio as an array of bytes, together with an optional config argument.

This TranscribeRequest argument is streamed, meaning that it may be received in multiple requests, each containing a part of the audio. If specified, the optional config argument must be sent only with the first request. Once all requests have been received and processed, the Transcribe procedure returns a message called TranscribeResponse which consists of the resulting transcription segments.

Connecting to microservice

There are multiple ways how you can communicate with our microservices.

Phonexia Python client

The easiest way to get started with testing is to use our simple Python client. To get it, run:

pip install phonexia-enhanced-speech-to-text-built-on-whisper-client

After the successful installation, run the following command to see the client options:

enhanced_speech_to_text_built_on_whisper_client --help

Versioning

We use Semantic Versioning.