Skip to main content
Version: 3.4.1

Voice Activity Detection

Voice Activity Detection detects the presence of speech in audio recordings. This technology is independent of the speaker, language, accent, domain, and channel. Its output is a labeled list of time segments in the original audio, indicating where speech is present. Currently, the technology only produces segments with speech labels, but this may change in the future to include other labels.

Voice Activity Detection is highly efficient in terms of both memory consumption and CPU usage. In the speed and memory measurement table, you can see that this technology operates significantly faster than real-time.

Use cases and application areas

For the reasons mentioned above, Voice Activity Detection is a crucial tool that often serves as the initial component in many speech processing pipelines. This placement optimizes the overall processing time by reducing the amount of data passed to subsequent resource-intensive components.

In addition to saving hardware costs by reducing processing time for more complex components, this technology is also ideal for rapid filtering (e.g., filtering out audio that contains little or no speech). It also provides systems such as chatbots with the ability to detect whether someone is speaking in an audio recording.

How does it work?

Phonexia's Voice Activity Detection technology consists of the following three components for detecting voice activity:

  • Energy-based detection
    Analyzes the signal based on its energy and removes segments where the energy is lower than a predefined threshold. This is ideal for removing silence and low-energy noise.
  • Spectrum-based analysis
    Useful for detecting tones, wide-band noise, and other types of noises.
  • Neural Network-based Voice Activity Detection
    Uses a neural network trained on large datasets to recognize voice activity in recordings.

The combination of these three components enables the technology to detect voice segments in audio recordings even in the presence of certain types of noise.

Speed and memory measurements

Below is a sample measurement performed on a dataset consisting of 20,150 recordings, ranging in length from several seconds to several minutes with variable amount of speech in those recordings.

ExperimentFTRTRelative Speed-upRAM [MB]GPU Memory [MB]
CPU (8 cores)31371.0433-
CPU (8 cores) + GPU73372.35852055

The measurements were performed on both CPU and GPU on a machine with 15GB of RAM, an AMD Ryzen 9 7950X3D 16-Core CPU, and an NVIDIA RTX 4000 SFF Ada Generation GPU. A single instance of the Voice Activity Detection microservice, with parallelism enabled and using 8 cores, was run on this machine.

The FTRT (Faster Than Real Time) column indicates the speed-up factor relative to the duration of the audio. For example, an FTRT of 10 means that 10 seconds of audio are processed in 1 second of CPU time.