ServiceNow-AI conducted a benchmark to evaluate how voice agents handle code-switched speech in enterprise settings. The study focused on automatic speech recognition (ASR), the first step in voice agent pipelines, due to its impact on downstream tasks. The benchmark tested models across four language pairs: Spanish-English, French-English, Canadian French-English, and German-English, with non-English as the matrix language and English embedded at varying lengths. The dataset included scenarios like employee inquiries and IT support requests, reflecting real-world enterprise interactions. Three metrics were used to assess performance: Word Error Rate (WER), Semantic Word Error Rate (SWER), and Answer Error Rate (AER). The results highlight the importance of accurate transcription in preserving meaning and preventing downstream errors. Source: huggingface
The benchmark dataset was created using an internal corpus of IT support and HR interactions. Code-switched utterances were generated by combining parallel English and non-English text, then filtered for natural code-switching. Utterances were limited to 12-40 words, excluded if they contained entities like emails or URLs, and required at least three switchable content words to ensure meaningful code-switching. The final dataset includes 259 Spanish-English, 298 French-English, 188 Canadian French-English, and 173 German-English records. Source: huggingface
The evaluation methodology included three metrics: WER, SWER, and AER. WER measures transcription accuracy, SWER assesses semantic errors, and AER evaluates whether transcription errors affect downstream tasks. For each utterance, three comprehension questions were generated to test if critical details like case numbers or dates were preserved. The results showed that ElevenLabs Scribe V2 and AssemblyAI Universal 3-Pro led in transcription accuracy, with ElevenLabs taking a narrow lead in most language pairs. Google Gemini 3 Flash performed closely, trailing slightly in Canadian French-English. OpenAI Whisper Large V3 Turbo had the highest WER, reflecting its known limitations with code-switched audio. Source: huggingface