Version: 4.1.0

Frequently Asked Questions

What hypervisors are supported by the Speech Platform 4 Virtual Appliance (VA)?

Hypervisors for which we provide installation guides are:

With GPU passthrough support:

VMware ESXi
Proxmox/QEMU/libvirt
Microsoft Hyper-V (only server editions)

Without GPU passthrough support:

VMware Workstation Pro
VirtualBox (suitable for testing and small scale deployment, not for large scale production)

Cloud platforms where our Partners have successfully deployed the VA (no installation guides available):

Google Cloud Platform, AWS and Microsoft Azure will work and support GPU passthrough, but we do not provide support for deployment on these platforms, only for internal operation of the Virtual Appliance.

In general, please be aware that any hypervisor which supports GPUs needs to be a Type 1 hypervisor or a GPU instance in the case of cloud platforms.

How do I deploy the Virtual Appliance on any supported hypervisor?

Please refer to our installation guides for various hypervisors here: Installation guides.

Do you provide application support for all versions of the Phonexia Speech Platform 4 Virtual Appliance

Please note that application support is provided only for the latest version of the Phonexia Speech Platform 4 Virtual Appliance, which includes all recent bug fixes and enhancements. Older versions are no longer maintained, and support for them is available on a very limited basis.

What are best GPUs for running GPU-supported technologies?

We support only NVIDIA GPUs as they are CUDA capable - for a complete list please refer to CUDA GPU Compute Capability - every GPU stated in this list is suitable for usage with our GPU-powered technologies - the higher the Compute Capability the better. Please refer to our guide on System Requirements.

I want to transcribe 200 hours per day through Enhanced Speech to Text Built on Whisper, what GPU should I use?

We are unable to provide an exact GPU model due to the fact that this is heavily dependent on various factors, such as audio quality, audio codec, percent of net speech in the audio, the language of the recording, etc. Please refer to our performance measurement page here: Enhanced Speech to Text Built on Whisper performance measurements.

Our rough estimate is that NVIDIA RTX 4000 Ada/NVIDIA T4 is capable of handling 20 FTRT = 20 hours of audio per processing hour for the English language. As English is the fastest language, other languages will have worse performance. We strongly recommend testing on the target data before switching the environment to production.

For reference purposes, we have also tested that NVIDIA RTX 4060 is capable of handling 10 FTRT for English language. Please note that consumer-grade GPUs are not meant for continuous processing and using them as such might void warranty on such GPUs.

For more information about the FTRT metric, please refer to our explanation page.

I want to enable GPU-passthrough on my hypervisor, how?

This is out scope for Phonexia and is your responsibility. We have a guide on how to enable GPU-passthrough on VMware ESXi and Proxmox, but your mileage may vary. Should you be interested, please find those guides in the installation guides page. Please note that these guides on GPU passthrough are provided as is without any guarantees that it will work on the target environment.

Something is not working as it should - what should I do?

Before contacting Phonexia Consulting and Support Team, please refer to the Troubleshooting guide in our Documentation - there's a host of information on various scenarios which can happen during operation of the Virtual Appliance.

Should the issue persist even after undertaking the steps mentioned in the troubleshooting guide, please generate diagnostic data following the Getting Diagnostics Data from Virtual Appliance guide and provide them to the Phonexia Consulting and Support Team together with an exact description of the issue.

How do I find out if the GPU is enabled and the Virtual Appliance can see it?

Use this guide: Adjustments, section Run Technology on GPU, specifically the part with the nvidia-smi command.

If the quality of output of any technology is bad, how do I fix it?

Most common reasons for bad output is bad input quality - too compressed audio, too low bitrate, too much noise, non-speech segments, present reverberations, one silent channel, etc. The quality of the audio plays a crucial role in achieving satisfactory results with any speech processing technology, whether it's simple voice activity detection, speech transcription, voice biometry, or other applications.

There are two main aspects of audio quality:

technical quality of the audio data (format, codec, bitrate, SNR, etc.)
sound quality of the actual content (background noise, reverberations, etc.)

Technical quality

Using inappropriate audio codec, heavy compression, or too low bitrate can damage or even completely destroy essential parts of the audio signal required by speech technologies.

Commonly used audio compressions exploit the perceptual limitation of human hearing and can remove frequencies which are covered by other frequencies. Therefore, to get satisfactory results from speech technologies, it is crucial to use an appropriate audio format.

TIP

Tools like MediaInfo can easily give you technical information about your audio files.

👍 DO'S	👎 DONT'S
Set your PBX, media server or recording device to one of these formats (in order of preference): Uncompressed WAV (16-bit, 8 kHz or more) A-law or μ-law (8-bit, 8 kHz) in WAV Lossless formats like FLAC OPUS format (lossy, but developed with speech in mind)	Avoid prioritizing the smallest possible audio file sizes, attempting to squeeze the maximum number of recordings into minimal storage space. Severe compressions like MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really significantly degrade the audio quality. If you really have to use MP3, use bitrates of at least 32 kbit/s per channel, and refrain from using joint-stereo encoding[^1] for 2-channel audio. Use full stereo instead.

👍 DO'S

👎 DONT'S

Set your PBX, media server or recording device to one of these formats (in order of preference):

Uncompressed WAV (16-bit, 8 kHz or more)
A-law or μ-law (8-bit, 8 kHz) in WAV
Lossless formats like FLAC
OPUS format (lossy, but developed with speech in mind)

Avoid prioritizing the smallest possible audio file sizes, attempting to squeeze the maximum number of recordings into minimal storage space.

Severe compressions like MPEG 2.5 Layer 3 (MP3) with bitrates only 16 or even 12 kbit/s per channel really significantly degrade the audio quality.

If you really have to use MP3, use bitrates of at least 32 kbit/s per channel, and refrain from using joint-stereo encoding[^1] for 2-channel audio. Use full stereo instead.

[^1] The joint-stereo encoding – which is commonly used by default in MP3 encoders – is tailored for usage with music audio, where both channels usually contain almost the same signal. Using joint-stereo encoding for telephony stereo, where each channel contains completely different signal (when one side speaks, the other side is silent) actually cripples the audio further.

Note

If the audio has already been heavily compressed, converting it to one of the recommended formats does not restore the information lost during the original compression

Sound quality

Quality of the actual audio content is just as important as the technical quality.

Unwanted sounds such as room reverberations, background noise (e.g., cars on the street, dogs barking nearby), ambient voices (e.g., people talking in the office, TV playing in the room), or compression artifacts can significantly impact the effectiveness of speech technologies (e.g., speaker identification precision, transcription accuracy).

Therefore, it is essential to ensure the audio is as clean as possible.

👍 DO'S	👎 DONT'S
Capture the sound as close to the source as possible, i.e. as close to the speaker’s mouth as possible as close to the recording source as possible to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by potential multiple recordings during transfer. Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts.	In general, the following recording methods or sources negatively affect sound quality: Surveillance camera microphones Built-in notebook microphone Smartphone lying on a desk, or hidden under it, etc. Hidden bug microphone These devices are typically designed to capture all sounds, including those undesirable for speech processing, such as office ambient noise, reverberations, other people talking, and background TV noise. Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones, or bugs tend to use heavily compressed formats by default.

👍 DO'S

👎 DONT'S

Capture the sound as close to the source as possible, i.e.

as close to the speaker’s mouth as possible
as close to the recording source as possible

to minimize the amount of ambient sounds and noise, reverberations, or artifacts caused by potential multiple recordings during transfer.

Store the audio in appropriate format (see above), to avoid distorting the sound by compression artifacts.

In general, the following recording methods or sources negatively affect sound quality:

Surveillance camera microphones
Built-in notebook microphone
Smartphone lying on a desk, or hidden under it, etc.
Hidden bug microphone

These devices are typically designed to capture all sounds, including those undesirable for speech processing, such as office ambient noise, reverberations, other people talking, and background TV noise.

Also, do not store the recorded audio in compressed formats. Typically, surveillance cameras, smartphones, or bugs tend to use heavily compressed formats by default.

What hypervisors are supported by the Speech Platform 4 Virtual Appliance (VA)?​

How do I deploy the Virtual Appliance on any supported hypervisor?​

Do you provide application support for all versions of the Phonexia Speech Platform 4 Virtual Appliance​

What are best GPUs for running GPU-supported technologies?​

I want to transcribe 200 hours per day through Enhanced Speech to Text Built on Whisper, what GPU should I use?​

I want to enable GPU-passthrough on my hypervisor, how?​

Something is not working as it should - what should I do?​

How do I find out if the GPU is enabled and the Virtual Appliance can see it?​

If the quality of output of any technology is bad, how do I fix it?​

Technical quality​

Sound quality​