Microsoft Research has introduced Mirage, a new video world model that improves spatial consistency during long camera movements. The model stores internal image features in a spatial memory within its latent space, avoiding the need for costly pixel-based 3D point clouds. This approach allows for faster generation and stable spatial structures, even when the camera moves extensively. Mirage was developed in collaboration with several universities and builds on Alibaba's open-source video model Wan2.2, with additional modules and fine-tuning using LoRA adapters.

Mirage outperforms color-based rivals on the WorldScore benchmark, surpassing Spatia and leaving general video generators like Wan2.1 and CogVideoX behind. It excels at maintaining a scene's spatial structure and keeping surfaces consistent across multiple frames. In closed-loop tests on the RealEstate10K dataset, Mirage leads two of three metrics when the camera circles back to its starting point, a challenging test for maintaining accuracy over time. The model's efficiency is its strongest feature, with compute costs per frame remaining stable after the first segment, leading to up to 10.57x faster generation and 55x less memory usage compared to color-based systems.

The researchers acknowledge limitations, noting that moving objects are excluded from long-term memory due to unreliable geometry. Busy scenes benefit less from spatial memory than quiet interiors. The team highlights dynamic content storage as the next challenge to address. Video world models are a growing area of AI research, with companies like Google exploring interactive environments and world models for real-time simulations.

Source: thedecoder