NVIDIA has released Cosmos 3, its first open omni-model for physical AI, now available on Hugging Face. This model combines world generation, physical reasoning, and action generation into a single unified architecture, eliminating the need for multiple models and inference pipelines. The release includes two versions of the model: Cosmos 3 Nano and Cosmos 3 Super, each optimized for different deployment scenarios.
Cosmos 3 is built on a Mixture-of-Transformers (MoT) architecture that processes text, image, video, audio, and action modalities within a single unified framework. It uses dedicated encoders for each modality and projects them into a shared representation space. The model splits input sequences into autoregressive and diffusion subsequences for reasoning and generation, respectively. This allows the model to seamlessly switch between acting as a vision language model, a video generator, a forward/inverse dynamics model, or a robot policy without architectural changes.
The release includes Cosmos 3 Nano, a 16B parameter model optimized for efficient inference, and Cosmos 3 Super, a 64B parameter model designed for large-scale synthetic data generation and research. Both versions are available on Hugging Face, with the Nano version running on workstation-grade GPUs and the Super version on NVIDIA Hopper and Blackwell GPUs.
Source: huggingface