ServiceNow-AI has released an updated version of its EVA-Bench benchmark, expanding its scope to three enterprise domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery. The new release includes 213 evaluation scenarios across 121 tools, representing a roughly 4x increase in scenario coverage compared to the original release. Every scenario was validated against three frontier models to ensure fairness and challenge. All datasets are open-source and available for download. Source: huggingface

The datasets are designed for multiple audiences, including developers evaluating voice agents and researchers building their own evaluation datasets. The post outlines the end-to-end generation and validation process, detailing how each domain was designed and the two new additions. It also previews an upcoming multilingual extension, which will broaden the benchmark's reach beyond English-only deployments. The release includes data design principles that guide the creation of the datasets, ensuring they are realistic, varied, and reproducible. Source: huggingface

The EVA-Bench datasets are built with five key principles: voice-first scope, realism, variety, authentication, and reproducibility. These principles ensure that the scenarios are grounded in real-world call patterns, reflect actual enterprise constraints, and are designed to evaluate models fairly. The datasets include scenarios that cover single-intent calls, multi-intent calls, and adversarial calls, with some cases involving unsatisfiable user goals. Source: huggingface

Source: huggingface