Model Release

EVA-Bench Expands to Three Domains with 213 Scenarios

ServiceNow-AI's EVA-Bench 2.0 adds three domains, covering 213 scenarios across 121 tools, expanding coverage by about 4x.

Image: Hugging Face

ServiceNow-AI has released an updated version of its EVA-Bench benchmark, expanding its scope to three enterprise domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery. The new release includes 213 evaluation scenarios across 121 tools, representing a roughly 4x increase in scenario coverage compared to the original release. Every scenario was validated against three frontier models to ensure fairness and challenge. All datasets are open-source and available for download. Source: huggingface

The datasets are designed for multiple audiences, including developers evaluating voice agents and researchers building their own evaluation datasets. The post outlines the end-to-end generation and validation process, detailing how each domain was designed and the two new additions. It also previews an upcoming multilingual extension, which will broaden the benchmark's reach beyond English-only deployments. The release includes data design principles that guide the creation of the datasets, ensuring they are realistic, varied, and reproducible. Source: huggingface

The EVA-Bench datasets are built with five key principles: voice-first scope, realism, variety, authentication, and reproducibility. These principles ensure that the scenarios are grounded in real-world call patterns, reflect actual enterprise constraints, and are designed to evaluate models fairly. The datasets include scenarios that cover single-intent calls, multi-intent calls, and adversarial calls, with some cases involving unsatisfiable user goals. Source: huggingface

Source: huggingface

Key points

ServiceNow-AI's EVA-Bench 2.0 expands to three domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery.
The new release includes 213 evaluation scenarios across 121 tools, representing a roughly 4x increase in scenario coverage.
Every scenario was validated against three frontier models: OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6.
All datasets are open-source and available for download through the Hugging Face library.
The datasets are designed with five principles: voice-first scope, realism, variety, authentication, and reproducibility.
Scenarios include single-intent calls, multi-intent calls, and adversarial calls, with some involving unsatisfiable user goals.

Source: Hugging Face Read the original →

WRITTEN BY

Alex Lindgren

LLMs & Frontier Models

Alex covers the large language models and their impact on society.

EVA-Bench Expands to Three Domains with 213 Scenarios

Key points

Related articles

Anthropic's Claude Opus 5 Costs Less Than Fable 5 While Matching Performance

Anthropic Releases Opus 5 Focused on Token Efficiency

Moonshot AI's Kimi K3 Sparks US-China AI Race

Kimi K3 Sparks AI Panic Amid U.S. Industry Reactions