Enhanced Speech To Text Built On Whisper
Phonexia enhanced-speech-to-text-built-on-whisper is a tool for transcribing speech from audio recordings into written text. This tool uses custom voice activity detection for better performance. To learn more, visit the technology's home page.
Versioning
We use Semantic Versioning.
Quick reference
- Maintained by Phonexia
- Contact us via e-mail, or using Phonexia Service Desk
- File an issue
- See list of licenses
- See terms of use
How to use this image
Getting the image
You can easily obtain the docker image from docker hub. There are 2 variants of the image. One for CPU, one for GPU with tag ending with gpu.
To get the latest CPU image, run:
docker pull phonexia/enhanced-speech-to-text-built-on-whisper:latest
To get the latest GPU image, run:
docker pull phonexia/enhanced-speech-to-text-built-on-whisper:gpu
Running the image
You can start the microservice and list all the supported options by running:
docker run --rm -it phonexia/enhanced-speech-to-text-built-on-whisper:latest --help
The output should look like this:
Usage: enhanced-speech-to-text-built-on-whisper [OPTIONS]
Options:
-h,--help Print this help message and exit
-m,--model file/dir REQUIRED (Env:PHX_MODEL_PATH)
Path to model file or directory.
-k,--license_key string REQUIRED (Env:PHX_LICENSE_KEY)
License key.
-a,--listening_address address [[::]] (Env:PHX_LISTENING_ADDRESS)
Address on which the server will be listening. Address '[::]' also accepts IPv4 connections.
-p,--port number [8080] (Env:PHX_PORT)
Port on which the server will be listening.
-l,--log_level level [info] (Env:PHX_LOG_LEVEL)
Logging level. Possible values: error, warning, info, debug, trace.
--device TEXT:{cpu,cuda} [cpu] (Env:PHX_DEVICE)
Compute device used for inference
--num_threads_per_instance NUM [0] (Env:PHX_NUM_THREADS_PER_INSTANCE)
Number of threads per instance (applies only to CPU processing only). Microservice use N CPU threads for each request. Number of threads is automatically detected if set to 0.
--num_instances_per_device NUM:UINT > 0 [1] (Env:PHX_NUM_INSTANCES_PER_DEVICE)
Number of instances per device. Microservice can process requests concurrently if value is >1. Maximum number of concurrently running requests is (num_instances_per_device * device_indices.size())
--device_indices INT [[0]] ... (Env:PHX_DEVICE_INDICES)
List of devices to run the model on. Microservice can process requests concurrently if number of devices is >1. Maximum number of concurrently running requests is (num_instances_per_device * device_indices.size()
--use_vad BOOLEAN [1] (Env:PHX_USE_VAD)
Whether to use Voice Activity Detection (VAD) filtering
--seed UINT (Env:PHX_SEED) Seed for random generator
Note that the model and license_key options are required. To obtain the model and license, contact Phonexia.
You can specify the options either via command line arguments or via environmental variables.
Run the container with the mandatory parameters:
docker run --rm -it -v ${absolute-path-to-models}:/models phonexia/enhanced-speech-to-text-built-on-whisper:latest --model /models/${model} --license_key ${license-key}
Replace the absolute-path-to-models
, model
and license-key
with the corresponding values.
With this command, the container will start, and the microservice will be listening on port 8080 on localhost.
Performance optimization
The enhanced-speech-to-text-built-on-whisper
microservice supports GPU acceleration and vertical scaling
to optimize resource utilization and to enhance performance.
GPU acceleration is enabled by default in the GPU-enabled image. This image requires a CUDA-enabled GPU in the system. While primarily GPU acceleration will be utilized, specific processing tasks will still rely on CPU resources.
Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The microservice supports the following parameters:
num_instances_per_device
: Specifies the number of parallel transcriber instances to run on a single device (CPU or GPU). This value is applied consistently across all available devices.num_threads_per_instance
: Defines the number of CPU threads to utilize per transcriber instance.device_indices
: Specifies the indices of CPU or GPU devices where transcriber instances should run.
The total number of concurrent transcriber instances is determined by multiplying num_instances_per_device
by the number of devices specified by device_indices
. The resulting value represents the maximum
number of transcription requests that the microservice can process simultaneously.
Finding optimal scaling parameters
The primary limiting factor when scaling, is the memory bandwidth. Whisper models, with their large sizes, require significant data transfers between the CPU and RAM, or between the GPU and Video RAM (VRAM)
in the case of GPU acceleration. Increasing parallelization per device (by adjusting num_instances_per_device
or num_threads_per_instance
) will eventually saturate the memory bandwidth, and above a certain level of parallelization, diminishing performance gains will be achieved.
CPU processing
The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.
For latency prioritization, set num_instances_per_device
to 1 and focus on tuning num_threads_per_instance
.
If throughput is the priority, adjust both num_instances_per_device
and num_threads_per_instance
to find the optimal utilization.
GPU processing
With GPU processing enabled, the most computationally demanding tasks are handled by the GPU.
Therefore, setting num_threads_per_instance
to 1 is sufficient, as it only controls CPU parallelization.
To achieve minimal latency, set num_instances_per_device
to 1. This prevents multiple instances
from competing for the same GPU resources.
For enhanced throughput, gradually increment num_instances_per_device
while observing the throughput.
Once the throughput plateaus or decreases, the optimal balance between latency and throughput has
been reached. Based on our experiments, setting num_instances_per_device
to 3 provides
the best performance in terms of throughput regardless of model size and GPU.
Microservice communication
gRPC API
For communication, our microservices use gRPC, which is a high-performance, open-source Remote
Procedure Call (RPC
) framework that enables efficient communication between distributed systems using a variety
of programming languages. We use an interface definition language to specify a common interface and contracts
between components. This is primarily achieved by specifying methods with parameters and return types.
Take a look at our gRPC API documentation. The enhanced-speech-to-text-built-on-whisper
microservice defines a SpeechToText
service with remote procedures called Transcribe
and
ListSupportedLanguages
. The Transcribe
procedure accepts an argument (also referred to as "message") called
TranscribeRequest
, which contains the audio as an array of bytes, together with an optional
config argument.
This TranscribeRequest
argument is streamed, meaning that it may be received in multiple requests, each containing
a part of the audio. If specified, the optional config argument must be sent only with the first request. Once all
requests have been received and processed, the Transcribe
procedure returns a message called TranscribeResponse
which
consists of the resulting transcription segments.
Connecting to microservice
There are multiple ways how you can communicate with our microservices.
Using generated library
The most common way how to communicate with the microservices is via a programming language using a generated library.
Python library
If you use Python as your programming language, you can use our gRPC Python library.
To get this library, simply run:
pip install phonexia-grpc
You can then import:
- specific libraries for each microservice that provide the message wrappers
- stubs for the
gRPC
clients.
# phx_core contains classes common for multiple microservices like `Audio`.
import phonexia.grpc.common.core_pb2 as phx_core
# enhanced_speech_to_text_built_on_whisper_pb2 contains `TranscribeRequest`, `TranscribeResponse` and 'TranscribeConfig'.
import phonexia.grpc.technologies.enhanced_speech_to_text_built_on_whisper.v1.enhanced_speech_to_text_built_on_whisper_pb2 as stt
# enhanced_speech_to_text_built_on_whisper_pb2_grpc contains `SpeechToTextStub` needed to make requests.
import phonexia.grpc.technologies.enhanced_speech_to_text_built_on_whisper.v1.enhanced_speech_to_text_built_on_whisper_pb2_grpc as stt_grpc
Generate library for programming language of your choice
For the definition of microservice interfaces, we use the standard way of protocol buffers.
The services
, together with the procedures
and messages
that they expose, are defined in the so-called proto
files.
The .proto
files can be used to generate client libraries in many programming languages. Take a look at
protobuf tutorials for how to get started with generating the
library in the languages of your choice using the protoc
tool.
You can find the proto
files developed by Phonexia in this repository.
Using existing clients
Phonexia Python client
The easiest way to get started with testing is to use our simple Python client. To get it, run:
pip install phonexia-enhanced-speech-to-text-built-on-whisper-client
After the successful installation, run the following command to see the client options:
enhanced_speech_to_text_built_on_whisper_client --help
grpcurl client
If you need a simple tool for testing the microservice on the command line, you can use grpcurl. This tool can serialize and send a request for you, if you provide the request body in JSON format and specify the endpoint.
You need to make sure that the audio content in the body is encoded in Base64
. Unfortunately you need to do this manually as grpcurl
can't do this for you.
echo -n '{"audio": {"content": "'$(base64 -w0 < ${path_to_audio_file})'"}}' > ${path_to_body}
Replace path_to_audio_file
and path_to_body
with corresponding values.
Now you can make the request. The microservice supports reflection. That means that you don't need to know the API in advance to make a request.
grpcurl -plaintext -use-reflection -d @ localhost:8080 phonexia.grpc.technologies.enhanced_speech_to_text_built_on_whisper.v1.SpeechToText/Transcribe < ${path_to_body}
The grpcurl
automatically serializes the response to this request into JSON
including the transcription segments
and the detected language.
GUI clients
If you'd prefer to use a GUI client like Postman or Warthog to test the microservice, take a look at the GUI Client page in our documentation. Note that you will still need to convert the audio into the Base64 format manually as those tools do not support it by default either.