Version: 3.1.0

Scaling

Scaling the Speech to Text built on Whisper involves adjusting its parameters to optimally utilize the available hardware resources. In this article, we explore scaling parameters for optimizing resource utilization and achieving the desired balance between throughput (number of requests processed) and latency (processing time per request) of the speech to text deployment.

These parameters apply to both microservice and virtual machine deployments, as the virtual machine uses the same underlying microservices and exposes similar settings.

GPU acceleration

GPU acceleration can be enabled by setting the device option to CUDA (requires a GPU-enabled image). While primarily GPU acceleration will be utilized, specific processing tasks will still rely on CPU resources.

Scaling parameters

Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The deployment can be tuned with the following parameters:

num_instances_per_device: Specifies the number of concurrent transcriber instances to run on a single device (CPU or GPU). This value is applied across all available devices.
num_threads_per_instance: Defines the number of CPU threads to utilize per transcriber instance. This primarily affects CPU processing. When using GPU acceleration, one thread per instance is sufficient.
device_indices: Specifies which CPU or GPU devices to run transcriber instances on. While multiple GPUs are possible, transferring intermediate results between them through RAM introduces additional overhead. It's generally better to run multiple microservice instances on separate devices.

The total number of concurrent transcriber instances is determined by multiplying num_instances_per_device by the number of devices specified by device_indices. The resulting value represents the maximum concurrent number of transcription requests that the transcriber can process simultaneously.

Finding Optimal Scaling Parameters

The primary limiting factor when scaling, is the memory bandwidth. Our Speech to Text solution based on Whisper utilizes models with large sizes, requiring significant data transfers between the CPU and RAM, or between the GPU and Video RAM (VRAM) in the case of GPU acceleration. Increasing parallelization per device (by adjusting num_instances_per_device or num_threads_per_instance) will eventually saturate the memory bandwidth, and above a certain level of parallelization, diminishing performance gains will be achieved.

CPU Processing

The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.

For latency prioritization, set num_instances_per_device to 1 and focus on tuning num_threads_per_instance. If throughput is the priority, adjust both num_instances_per_device and num_threads_per_instance to find the optimal utilization.

GPU Processing

With GPU processing enabled, the most computationally demanding tasks are handled by the GPU. Therefore, setting num_threads_per_instance to 1 is sufficient, as it only controls CPU parallelization.

To achieve minimal latency, set num_instances_per_device to 1. This prevents multiple instances from competing for the same GPU resources.

For enhanced throughput, gradually increment num_instances_per_device while observing the throughput. Once the throughput plateaus or decreases, the optimal balance between latency and throughput has been reached. Based on our experiments, setting num_instances_per_device to 3 provides the best performance in terms of throughput regardless of model size and GPU.

GPU acceleration​

Scaling parameters​

Finding Optimal Scaling Parameters​

CPU Processing​

GPU Processing​

GPU acceleration

Scaling parameters

Finding Optimal Scaling Parameters

CPU Processing

GPU Processing