Researchers have developed a new open-source voice model called 'Audio Interaction' that continuously listens to audio streams and makes real-time decisions every 0.4 seconds on whether to speak or remain silent. This model combines multiple tasks such as dialog, translation, transcription, and sound recognition into a single system, improving response efficiency and reducing waiting times. The model processes listening and speaking in parallel, which helps it perform better in proactive noise detection compared to other models like Gemini 3 Flash. The system is designed to mimic how real listeners handle audio, allowing it to respond dynamically to everyday sounds and interactions.
The 'Audio Interaction' model was trained on a custom artificial dataset consisting of 302,000 hours of audio, enabling it to handle complex audio scenarios. The researchers created the StreamAudio-2M dataset through three stages, including designing plausible settings, searching for matching audio clips, and generating missing sounds using audio models like AudioX or ElevenLabs. This dataset includes 2.6 million units across seven skill areas and 28 subtasks, providing a comprehensive foundation for the model's training. The model's ability to process continuous audio and make real-time decisions represents a significant advancement in audio processing capabilities.
The researchers identified two recurring issues during training: the model forgetting earlier content in long, noisy sequences and firing too often on irrelevant sounds. To address these, they introduced techniques like asking questions that reference earlier parts of the audio stream and using large amounts of verified silence and background audio to train the model. The model was tested on the ProactiveSound Bench, where it outperformed several existing models, including Gemini 3 Flash, Kimi-Audio-Instruct, and Step-Audio 2. In real-time applications, the model splits incoming audio processing from response generation, allowing both to run in parallel and improving overall efficiency.
Source: thedecoder