Alibaba has released Qwen3.7-Plus, a new AI model that integrates visual understanding with agent capabilities, allowing it to operate graphical user interfaces and applications independently. The model is designed to recognize real-world scenes, read screen content, and generate code from visual templates. It is available as a proprietary offering through Alibaba Cloud, with pricing significantly lower than its text-based counterpart, Qwen3.7-Max.

In testing, the system demonstrated its ability to recreate desktop applications, perform cloud tasks, and independently program a complete app with 10,000 lines of code. A hybrid agent system built using Qwen3.7-Plus developed an English vocabulary learning app, running for over eleven hours and producing more than 10,000 lines of code across more than 1,000 agent calls. The process included requirements documentation, automated code generation, installation, test case creation, and independent version management.

The model excels at operating graphical interfaces, outperforming competitors like GPT-5.4 (xhigh), Opus 4.6 Max, and Gemini 3.1 Pro on AndroidWorld and ScreenSpot Pro benchmarks. However, it falls short in pure logic benchmarks, such as MedXpertQA-MM, where it trails behind Gemini 3.1 Pro and GPT-5.4. On the text side, its performance is described as on par with max-tier models without surpassing them across the board.

Source: thedecoder