Safety
OpenAI Shares Framework for Trustworthy Third-Party Model Evaluations
OpenAI outlines best practices for evaluating frontier models, emphasizing the importance of harness design and validation methods. A 2026 report highlights how evaluation setups can significantly impact results.
Image: OpenAI
OpenAI has released a report detailing a framework for conducting trustworthy third-party evaluations of frontier models, emphasizing the critical role of harness design in shaping outcomes. The report highlights that modern models, capable of using tools and maintaining state across steps, require evaluations that account for the environment in which tasks are performed. According to the report, evaluations must explicitly describe the claim being tested and the evidence supporting the validity of results. *Source: [openai](https://openai.com/index/trustworthy-third-party-evaluations-foundations/)*
The report identifies three primary categories of claims that evaluations may aim to support: capability elicitation, safeguard performance, and comparative analysis between models. It stresses that the harness—defined as the setup that facilitates a model’s actions—plays a crucial role in determining the observed performance. For example, a harness that preserves state and retries failed actions can allow a model to complete multi-step tasks that it otherwise cannot. The report also warns that standardized harnesses may understate a model’s capabilities if they omit features that aid task performance. *Source: [openai](https://openai.com/index/trustworthy-third-party-evaluations-foundations/)*
OpenAI further notes that evaluation results can be influenced by factors such as reward hacking, refusals, contamination, and broken problems. It recommends that evaluators provide detailed explanations of how these potential issues were addressed. The report also highlights that increasing test-time compute can significantly affect the capability an evaluation elicits, especially in domains where success is easily verifiable. *Source: [openai](https://openai.com/index/trustworthy-third-party-evaluations-foundations/)*
Key points
- OpenAI outlines best practices for evaluating frontier models, emphasizing the importance of harness design and validation methods.
- Evaluations must explicitly describe the claim being tested and the evidence supporting the validity of results.
- The harness—defined as the setup that facilitates a model’s actions—plays a crucial role in determining the observed performance.
- A harness that preserves state and retries failed actions can allow a model to complete multi-step tasks that it otherwise cannot.
- Standardized harnesses may understate a model’s capabilities if they omit features that aid task performance.
- Increasing test-time compute can significantly affect the capability an evaluation elicits, especially in domains where success is easily verifiable.