Safety

OpenAI Shares Framework for Trustworthy Third-Party Model Evaluations

OpenAI outlines best practices for evaluating frontier models, emphasizing the importance of harness design and validation methods. A 2026 report highlights how evaluation setups can significantly impact results.

Image: OpenAI

OpenAI has released a report detailing a framework for conducting trustworthy third-party evaluations of frontier models, emphasizing the critical role of harness design in shaping outcomes. The report highlights that modern models, capable of using tools and maintaining state across steps, require evaluations that account for the environment in which tasks are performed. According to the report, evaluations must explicitly describe the claim being tested and the evidence supporting the validity of results. Source: openai

The report identifies three primary categories of claims that evaluations may aim to support: capability elicitation, safeguard performance, and comparative analysis between models. It stresses that the harness—defined as the setup that facilitates a model’s actions—plays a crucial role in determining the observed performance. For example, a harness that preserves state and retries failed actions can allow a model to complete multi-step tasks that it otherwise cannot. The report also warns that standardized harnesses may understate a model’s capabilities if they omit features that aid task performance. Source: openai

OpenAI further notes that evaluation results can be influenced by factors such as reward hacking, refusals, contamination, and broken problems. It recommends that evaluators provide detailed explanations of how these potential issues were addressed. The report also highlights that increasing test-time compute can significantly affect the capability an evaluation elicits, especially in domains where success is easily verifiable. Source: openai

Key points

OpenAI outlines best practices for evaluating frontier models, emphasizing the importance of harness design and validation methods.
Evaluations must explicitly describe the claim being tested and the evidence supporting the validity of results.
The harness—defined as the setup that facilitates a model’s actions—plays a crucial role in determining the observed performance.
A harness that preserves state and retries failed actions can allow a model to complete multi-step tasks that it otherwise cannot.
Standardized harnesses may understate a model’s capabilities if they omit features that aid task performance.
Increasing test-time compute can significantly affect the capability an evaluation elicits, especially in domains where success is easily verifiable.

Source: OpenAI Read the original →

WRITTEN BY

Nadia Rahman

AI Safety, Alignment & Policy

Nadia follows AI safety, alignment, regulation, and the policy debates shaping the field.

OpenAI Shares Framework for Trustworthy Third-Party Model Evaluations

Key points

Related articles

OpenAI Unveils GPT-Red, LLM Super-Hacker for Cybersecurity

OpenAI Introduces GPT-Red for Enhanced Model Robustness

OpenAI Employees Fund Guardrails Alliance Against Pro-AI PAC

Meta faces lawsuit over AI-driven layoffs targeting protected workers