Scaling
The Enhanced Speech to Text Built on Whisper can be scaled to optimize resource utilization and achieve the desired balance between throughput (number of requests processed) and latency (processing time per request) of the speech to text deployment. In this article, we explore the scaling parameters that can be adjusted to achieve optimal performance.
Note: The virtual appliance and microservice deployments expose the same scaling parameters under slightly different names due to different naming conventions. For more details, see the installation pages for the microservice and virtual appliance.
GPU acceleration
GPU acceleration can be enabled by setting the device
option to CUDA
(requires a GPU-enabled image). While primarily GPU acceleration will be
utilized, specific processing tasks will still rely on CPU resources.
Scaling parameters
Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The deployment can be tuned with the following parameters:
device_indices
: Specifies which CPU or GPU devices to run transcriber instances on.num_instances_per_device
: Specifies the number of concurrent transcriber instances to run on a single device (CPU or GPU). This value is applied to all devices specified by thedevice_indices
parameter.num_threads_per_instance
: Defines the number of CPU threads to utilize per transcriber instance. This primarily affects CPU processing. When using GPU acceleration, one thread per instance is sufficient.
The total number of concurrent transcriber instances is determined by
multiplying num_instances_per_device
by the number of devices specified by
device_indices
. The resulting value represents the maximum concurrent number
of transcription requests that the transcriber can process simultaneously.
Finding Optimal Scaling Parameters
The primary limiting factor when scaling, is the memory bandwidth. Our Speech to
Text solution based on Whisper utilizes models with large sizes, requiring
significant data transfers between the CPU and RAM, or between the GPU and Video
RAM (VRAM) in the case of GPU acceleration. Increasing parallelization per
device (by adjusting num_instances_per_device
or num_threads_per_instance
)
will eventually saturate the memory bandwidth, and above a certain level of
parallelization, diminishing performance gains will be achieved.
CPU Processing
The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.
For latency prioritization, set num_instances_per_device
to 1 and focus on
tuning num_threads_per_instance
. If throughput is the priority, adjust both
num_instances_per_device
and num_threads_per_instance
to find the optimal
utilization.
GPU Processing
With GPU processing enabled, the most computationally demanding tasks are
handled by the GPU. Therefore, setting num_threads_per_instance
to 1 is
sufficient, as it only controls CPU parallelization.
To achieve minimal latency, set num_instances_per_device
to 1. This prevents
multiple instances from competing for the same GPU resources.
For enhanced throughput, gradually increment num_instances_per_device
while
observing the throughput. Once the throughput plateaus or decreases, the optimal
balance between latency and throughput has been reached. Based on our
experiments, setting num_instances_per_device
to 3 provides the best
performance in terms of throughput regardless of model size and GPU.