Model Release
NVIDIA Unveils Nemotron 3 Nano Omni Multimodal AI Model
NVIDIA's Nemotron 3 Nano Omni model topped six leaderboards for complex document intelligence and video/audio understanding, offering faster and more efficient multimodal AI capabilities.
Image: NVIDIA
NVIDIA has introduced the Nemotron 3 Nano Omni, an open multimodal AI model designed to streamline tasks that require integrating vision, speech, and language processing. This model enables AI agents to deliver faster, smarter responses with advanced reasoning across video, audio, image, and text. According to NVIDIA, the model sets a new efficiency frontier for open multimodal models, achieving leading accuracy and low cost. It topped six leaderboards for complex document intelligence, video, and audio understanding, as noted in the company's blog post. The model is already being adopted by several AI and software companies, including Aible, Applied Scientific Intelligence (ASI), Eka Care, Foxconn, H Company, Palantir, and Pyler, with additional organizations evaluating its capabilities. 'To build useful agents, you can’t wait seconds for a model to interpret a screen,' said Gautier Cloix, CEO of H Company. 'By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn’t practical before. This isn’t just a speed boost: It’s a fundamental shift in how our agents perceive and interact with digital environments in real time.' The model combines vision and audio encoders within its 30B-A3B, hybrid mixture-of-experts architecture, eliminating the need for separate perception models. This approach drives inference efficiency at scale and enables AI systems to achieve 9x higher throughput than other open omni models with the same interactivity. *Source: [nvidia](https://blogs.nvidia.com/blog/nemotron-3-nano-omni-multimodal-ai-agents/)*
Key points
- NVIDIA's Nemotron 3 Nano Omni topped six leaderboards for complex document intelligence, video, and audio understanding.
- The model enables AI agents to deliver faster, smarter responses with advanced reasoning across video, audio, image, and text.
- NVIDIA's Nemotron 3 Nano Omni sets a new efficiency frontier for open multimodal models with leading accuracy and low cost.
- The model combines vision and audio encoders within its 30B-A3B, hybrid mixture-of-experts architecture.
- NVIDIA's Nemotron 3 Nano Omni can work alongside proprietary cloud models or other NVIDIA Nemotron open models.
- H Company’s latest computer usage agent, powered by Nemotron 3 Nano Omni, uses a native input resolution of 1920×1080 pixels to achieve high-fidelity visual reasoning.
- NVIDIA's Nemotron 3 Nano Omni is released with open weights, datasets, and training techniques, giving organizations full transparency and control over customization and deployment.