Microsoft has introduced ASSERT, an open-source framework designed to simplify the testing of AI systems for application-specific behaviors. The tool enables developers to create thorough, scored tests by converting high-level, natural-language descriptions of goals or policies into structured evaluations. ASSERT generates problem scenarios and test cases, runs them against the target system, and scores the results. It also records the paths the AI system takes, including intermediate actions and tool calls, allowing developers to inspect where failures occur. Developers can provide system context, tools, and constraints to further customize the evaluations. For instance, a developer could specify that a document research AI agent shouldn’t send emails to people outside the company, and it should limit confidential information to C-level executives and provide concise summaries with prior context in mind. ASSERT uses these rules to generate test cases that check whether the system follows those rules on an ongoing basis. According to Microsoft, ASSERT fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a manner shaped by an application or product’s context, policies, and tools. Sarah Bird, chief product officer of Responsible AI at Microsoft, emphasized that evaluations are critical for making good decisions. She noted that without understanding the behavior of an AI system, it’s difficult to know if it meets an organization’s standards. Bird also highlighted that ASSERT can be used to evaluate systems during development, after deployment, and for continuous monitoring. This release comes amid a broader industry shift toward repeatable testing and regression checks as models become more capable. Researchers are focusing on benchmarks to measure how models behave under different conditions, with tools like Stanford’s HELM, MLCommons’ AILuminate, and evaluation groups like METR contributing to this effort.

Source: techcrunch