Version: 4.0.1

Voice Activity Detection

Voice Activity Detection detects the presence of speech in audio recordings. This technology is independent of the speaker, language, accent, domain, and channel. Its output is a labeled list of time segments in the original audio, indicating where speech is present. Currently, the technology only produces segments with speech labels, but this may change in the future to include other labels.

Voice Activity Detection is highly efficient in terms of both memory consumption and CPU usage. In the speed and memory measurement table, you can see that this technology operates significantly faster than real-time.

Use cases and application areas

For the reasons mentioned above, Voice Activity Detection is a crucial tool that often serves as the initial component in many speech processing pipelines. This placement optimizes the overall processing time by reducing the amount of data passed to subsequent resource-intensive components.

In addition to saving hardware costs by reducing processing time for more complex components, this technology is also ideal for rapid filtering (e.g., filtering out audio that contains little or no speech). It also provides systems such as chatbots with the ability to detect whether someone is speaking in an audio recording.

Profiles for configuration

The Voice Activity Detection microservice can be started with a configuration profile. The technology than produces output that matches the voice segmentation of other Phonexia technologies. Currently, there are two profiles:

speech-to-text: This profile is optimized for Speech-to-Text technology, ensuring accurate segmentation of speech for transcription purposes. The segments tend to be little bit longer on both sides, so that the technology does not lose the context. This is also the default value used.
voice-biometrics: This profile is designed for voice biometric technologies, such as Speaker Identification, to provide segmentation suitable for analyzing speaker characteristics. The segmentation is usually trimmed as much as possible, so that the unnecessary information is discarded and there is no additional overhead.

How does it work?

Phonexia's Voice Activity Detection technology consists of the following three components for detecting voice activity:

Energy-based detection
Analyzes the signal based on its energy and removes segments where the energy is lower than a predefined threshold. This is ideal for removing silence and low-energy noise.
Spectrum-based analysis
Useful for detecting tones, wide-band noise, and other types of noises.
Neural Network-based Voice Activity Detection
Uses a neural network trained on large datasets to recognize voice activity in recordings.

The combination of these three components enables the technology to detect voice segments in audio recordings even in the presence of certain types of noise.

Speed and memory measurements

Below is a sample measurement performed on a dataset consisting of 20,150 recordings, ranging in length from several seconds to several minutes with variable amount of speech in those recordings.

Experiment	FTRT	Relative Speed-up	RAM [MB]	GPU Memory [MB]
CPU (8 cores)	3137	1.0	433	-
CPU (8 cores) + GPU	7337	2.3	585	2055

The measurements were performed on both CPU and GPU on a machine with 15GB of RAM, an AMD Ryzen 9 7950X3D 16-Core CPU, and an NVIDIA RTX 4000 SFF Ada Generation GPU. A single instance of the Voice Activity Detection microservice, with parallelism enabled and using 8 cores, was run on this machine.

The FTRT (Faster Than Real Time) column indicates the speed-up factor relative to the duration of the audio. For example, an FTRT of 10 means that 10 seconds of audio are processed in 1 second of CPU time.

FAQ

How can I improve the processing speed?

Make sure you’re running Voice Activity Detection on GPU to speed up the processing.

Use cases and application areas​

Profiles for configuration​

How does it work?​

Speed and memory measurements​

FAQ​

How can I improve the processing speed?​

Use cases and application areas

Profiles for configuration

How does it work?

Speed and memory measurements

FAQ

How can I improve the processing speed?