Research

ServiceNow-AI Tests ASR Models on Code-Switching

ServiceNow-AI benchmarked seven ASR systems on code-switched speech, finding ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro performed best in Spanish-English and French-English pairs.

Image: Hugging Face

ServiceNow-AI conducted a benchmark to evaluate how voice agents handle code-switched speech in enterprise settings. The study focused on automatic speech recognition (ASR), the first step in voice agent pipelines, due to its impact on downstream tasks. The benchmark tested models across four language pairs: Spanish-English, French-English, Canadian French-English, and German-English, with non-English as the matrix language and English embedded at varying lengths. The dataset included scenarios like employee inquiries and IT support requests, reflecting real-world enterprise interactions. Three metrics were used to assess performance: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). The results highlight the importance of accurate transcription in preserving meaning and preventing downstream errors. Source: huggingface

The benchmark dataset was created using an internal corpus of IT support and HR interactions. Code-switched utterances were generated by combining parallel English and non-English text, then filtered for natural code-switching. Utterances were limited to 12-40 words, excluded if they contained entities like emails or URLs, and required at least three switchable content words to ensure meaningful code-switching. The final dataset includes 259 Spanish-English, 298 French-English, 188 Canadian French-English, and 173 German-English records. Source: huggingface

The evaluation methodology included three metrics: WER, SWER, and AER. WER measures transcription accuracy, SWER assesses semantic errors, and AER evaluates whether transcription errors affect downstream tasks. For each utterance, three comprehension questions were generated to test if critical details like case numbers or dates were preserved. The results showed that ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro led in transcription accuracy, with ElevenLabs taking a narrow lead in most language pairs. Google Gemini 3 Flash performed closely, trailing slightly in Canadian French-English. OpenAI Whisper Large V3 Turbo had the highest WER, reflecting its known limitations with code-switched audio. Source: huggingface

Key points

ServiceNow-AI benchmarked seven ASR systems on code-switched speech.
ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro performed best in Spanish-English and French-English pairs.
The benchmark tested models across four language pairs: Spanish-English, French-English, Canadian French-English, and German-English.
The dataset includes 259 Spanish-English, 298 French-English, 188 Canadian French-English, and 173 German-English records.
Three metrics were used: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER).
ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro led in transcription accuracy, with ElevenLabs taking a narrow lead in most language pairs.
Google Gemini 3 Flash performed closely, trailing slightly in Canadian French-English.

Source: Hugging Face Read the original →

WRITTEN BY

Maya Chen

AI Research & Breakthroughs

Maya breaks down the latest AI research papers, benchmarks, and technical breakthroughs into plain language.