Model Release

Audio-Interaction Model Listens Nonstop and Decides Every 0.4 Seconds Whether to Speak or Stay Silent

The 'Audio Interaction' AI model processes continuous audio streams, making real-time decisions every 0.4 seconds on whether to speak or stay silent, outperforming Gemini 3 Flash in noise detection.

Image: The Decoder

Researchers have developed a new open-source voice model called 'Audio Interaction' that continuously listens to audio streams and makes real-time decisions every 0.4 seconds on whether to speak or remain silent. This model combines multiple tasks such as dialog, translation, transcription, and sound recognition into a single system, improving response efficiency and reducing waiting times. The model processes listening and speaking in parallel, which helps it perform better in proactive noise detection compared to other models like Gemini 3 Flash. The system is designed to mimic how real listeners handle audio, allowing it to respond dynamically to everyday sounds and interactions.

The 'Audio Interaction' model was trained on a custom artificial dataset consisting of 302,000 hours of audio, enabling it to handle complex audio scenarios. The researchers created the StreamAudio-2M dataset through three stages, including designing plausible settings, searching for matching audio clips, and generating missing sounds using audio models like AudioX or ElevenLabs. This dataset includes 2.6 million units across seven skill areas and 28 subtasks, providing a comprehensive foundation for the model's training. The model's ability to process continuous audio and make real-time decisions represents a significant advancement in audio processing capabilities.

The researchers identified two recurring issues during training: the model forgetting earlier content in long, noisy sequences and firing too often on irrelevant sounds. To address these, they introduced techniques like asking questions that reference earlier parts of the audio stream and using large amounts of verified silence and background audio to train the model. The model was tested on the ProactiveSound Bench, where it outperformed several existing models, including Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2. In real-time applications, the model splits incoming audio processing from response generation, allowing both to run in parallel and improving overall efficiency.

Source: thedecoder

Key points

The 'Audio Interaction' model listens to continuous audio streams and decides every 0.4 seconds whether to speak or stay silent.
The model processes listening and speaking in parallel, minimizing response waiting times and outperforming Gemini 3 Flash in proactive noise detection.
Trained on a 302,000-hour artificial dataset, the model combines dialog, translation, transcription, and sound recognition in a single system.
The StreamAudio-2M dataset includes 2.6 million units and about 302,000 hours of audio across seven skill areas and 28 subtasks.
The model improves on previous systems by handling multiple tasks simultaneously and responding dynamically to everyday sounds.
The model was tested on the ProactiveSound Bench, where it outperformed Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2.
The researchers split audio processing from response generation to improve real-time efficiency and reduce latency.

Source: The Decoder Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.

Audio-Interaction Model Listens Nonstop and Decides Every 0.4 Seconds Whether to Speak or Stay Silent

Key points

Related articles

Alibaba's Qwen-Image-3.0 Renders Full Infographics and Ten-Pixel Text

DeepMind Launches Gemini 3.5 Flash Cyber for Cybersecurity

Z.ai's GLM 5.2 Model Closes Coding Gap With U.S. Rivals

Alibaba's Qwen-Audio-3.0-TTS-Plus Leads Text-to-Speech Rankings