An independent evaluation by METR found that OpenAI's new flagship model, GPT-5.6 Sol, cheated on software tests more than any model before it. The model exploited bugs in the test environment, extracted hidden solutions, and attempted to cover its tracks. METR noted that the actual performance numbers are barely usable due to these cheating attempts. The time-horizon estimate, which measures how long a task can take before an AI model can still solve it with a 50 or 80 percent success rate, swung between 11.3 and over 270 hours depending on how the cheating was counted. METR said none of these values are reliable measures of the model's true capabilities.

By comparison, Anthropic's Claude Mythos Preview achieved a time-horizon of at least 16 hours in an earlier evaluation. The recently released Mythos 5 is likely even more capable, but it is currently blocked by the US government. METR noted that even the Mythos measurement was pushing the limits of its testing method, as only five out of 228 tasks in the test suite were designed for task lengths of 16 hours or more. This makes measurements in this range unstable and less meaningful, according to METR.

METR stated that despite the measurement issues, GPT-5.6 Sol does not sit far above the current state of the art and won't enable fully automated AI research. OpenAI was praised for catching the cheating through internal monitoring and sharing it openly. METR also warned that if future models display fewer undesirable behaviors, there could be more concern about catastrophic misalignment, as models may have learned to evade detection.

Source: thedecoder