Scaling
Scaling the Speech to Text built on Whisper involves adjusting its parameters to optimally utilize the available hardware resources. In this article, we explore scaling parameters for optimizing resource utilization and achieving the desired balance between throughput (number of requests processed) and latency (processing time per request) of the speech to text deployment.
These parameters apply to both microservice and virtual machine deployments, as the virtual machine uses the same underlying microservices and exposes similar settings.
GPU acceleration
GPU acceleration can be enabled by setting the device
option to CUDA
(requires a GPU-enabled image). While primarily GPU acceleration will be
utilized, specific processing tasks will still rely on CPU resources.
Scaling parameters
Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The deployment can be tuned with the following parameters:
num_instances_per_device
: Specifies the number of concurrent transcriber instances to run on a single device (CPU or GPU). This value is applied across all available devices.num_threads_per_instance
: Defines the number of CPU threads to utilize per transcriber instance. This primarily affects CPU processing. When using GPU acceleration, one thread per instance is sufficient.device_indices
: Specifies which CPU or GPU devices to run transcriber instances on. While multiple GPUs are possible, transferring intermediate results between them through RAM introduces additional overhead. It's generally better to run multiple microservice instances on separate devices.
The total number of concurrent transcriber instances is determined by
multiplying num_instances_per_device
by the number of devices specified by
device_indices
. The resulting value represents the maximum concurrent number
of transcription requests that the transcriber can process simultaneously.
Finding Optimal Scaling Parameters
The primary limiting factor when scaling, is the memory bandwidth. Our Speech to
Text solution based on Whisper utilizes models with large sizes, requiring
significant data transfers between the CPU and RAM, or between the GPU and Video
RAM (VRAM) in the case of GPU acceleration. Increasing parallelization per
device (by adjusting num_instances_per_device
or num_threads_per_instance
)
will eventually saturate the memory bandwidth, and above a certain level of
parallelization, diminishing performance gains will be achieved.
CPU Processing
The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.
For latency prioritization, set num_instances_per_device
to 1 and focus on
tuning num_threads_per_instance
. If throughput is the priority, adjust both
num_instances_per_device
and num_threads_per_instance
to find the optimal
utilization.
GPU Processing
With GPU processing enabled, the most computationally demanding tasks are
handled by the GPU. Therefore, setting num_threads_per_instance
to 1 is
sufficient, as it only controls CPU parallelization.
To achieve minimal latency, set num_instances_per_device
to 1. This prevents
multiple instances from competing for the same GPU resources.
For enhanced throughput, gradually increment num_instances_per_device
while
observing the throughput. Once the throughput plateaus or decreases, the optimal
balance between latency and throughput has been reached. Based on our
experiments, setting num_instances_per_device
to 3 provides the best
performance in terms of throughput regardless of model size and GPU.