Skip to main content
Version: 3.7.0

Scaling

The Enhanced Speech to Text Built on Whisper can be scaled to optimize resource utilization and achieve the desired balance between throughput (number of requests processed) and latency (processing time per request) of the speech to text deployment. In this article, we explore the scaling parameters that can be adjusted to achieve optimal performance.

Note: The virtual appliance and microservice deployments expose the same scaling parameters under slightly different names due to different naming conventions. For more details, see the installation pages for the microservice and virtual appliance.

GPU acceleration

GPU acceleration can be enabled by setting the device option to CUDA (requires a GPU-enabled image). While primarily GPU acceleration will be utilized, specific processing tasks will still rely on CPU resources.

Scaling parameters

Scaling parameters can be used to control the parallelism to optimally utilize available resources and to achieve the desired trade-off between throughput and latency. The deployment can be tuned with the following parameters:

  • device_indices: Specifies which CPU or GPU devices to run transcriber instances on.
  • num_instances_per_device: Specifies the number of concurrent transcriber instances to run on a single device (CPU or GPU). This value is applied to all devices specified by the device_indices parameter.
  • num_threads_per_instance: Defines the number of CPU threads to utilize per transcriber instance. This primarily affects CPU processing. When using GPU acceleration, one thread per instance is sufficient.

The total number of concurrent transcriber instances is determined by multiplying num_instances_per_device by the number of devices specified by device_indices. The resulting value represents the maximum concurrent number of transcription requests that the transcriber can process simultaneously.

Finding Optimal Scaling Parameters

The primary limiting factor when scaling, is the memory bandwidth. Our Speech to Text solution based on Whisper utilizes models with large sizes, requiring significant data transfers between the CPU and RAM, or between the GPU and Video RAM (VRAM) in the case of GPU acceleration. Increasing parallelization per device (by adjusting num_instances_per_device or num_threads_per_instance) will eventually saturate the memory bandwidth, and above a certain level of parallelization, diminishing performance gains will be achieved.

CPU Processing

The effectiveness of CPU processing depends on various factors, including hardware specification and model size. Empirical analysis is essential to determine optimal parameters.

For latency prioritization, set num_instances_per_device to 1 and focus on tuning num_threads_per_instance. If throughput is the priority, adjust both num_instances_per_device and num_threads_per_instance to find the optimal utilization.

GPU Processing

With GPU processing enabled, the most computationally demanding tasks are handled by the GPU. Therefore, setting num_threads_per_instance to 1 is sufficient, as it only controls CPU parallelization.

To achieve minimal latency, set num_instances_per_device to 1. This prevents multiple instances from competing for the same GPU resources.

For enhanced throughput, gradually increment num_instances_per_device while observing the throughput. Once the throughput plateaus or decreases, the optimal balance between latency and throughput has been reached. Based on our experiments, setting num_instances_per_device to 3 provides the best performance in terms of throughput regardless of model size and GPU.