Amazon has introduced a new open-source testing framework for voice agents, designed to address the challenges of evaluating and iterating on voice-based customer service systems. The framework, called the Nova Sonic Test Harness, allows developers to run full multi-turn conversations automatically, evaluate results, and refine system prompts and tool configurations efficiently. This solution eliminates the need for manual testing, which is often slow and inconsistent, and enables teams to scale their evaluation processes significantly. The framework supports a range of evaluation criteria, including goal achievement, response accuracy, and tool usage, and is built to work seamlessly with Amazon Bedrock models and other AWS services. It also handles complex aspects like audio-text divergence and session reconnection, ensuring reliable testing even for long conversations. The framework is intended to streamline the development and deployment of voice agents, allowing teams to focus on improving user experience without the overhead of manual QA processes. The test harness is designed to be used in conjunction with an LLM judge, which evaluates the conversation based on predefined rubrics, ensuring unbiased and consistent results. It also supports the use of synthetic audio for testing the full speech-to-speech pipeline, providing a more realistic simulation of real-world interactions. The framework is part of Amazon's broader effort to improve the development and evaluation of voice agents, which are increasingly being used for tasks such as appointment booking, order inquiries, and account management. The system is designed to be flexible, allowing users to define test scenarios using JSON files, which can include details like user personas, tool configurations, and evaluation criteria. This approach enables developers to focus on setting goals and evaluation standards rather than expected outputs, which is crucial given the non-deterministic nature of voice agent responses. The framework also includes features like session continuation and history replay, which ensure that long conversations can be tested without interruption. Overall, the Nova Sonic Test Harness represents a significant step forward in the testing and evaluation of voice agents, offering a scalable and automated solution that reduces the time and effort required for quality assurance.

The Nova Sonic Test Harness is built to support a variety of testing scenarios, including multi-turn conversations with user personas, tool configurations, and evaluation criteria. It allows developers to define test scenarios in JSON files, which include details such as the user's goal, the system prompt, and the evaluation rubrics. The framework runs conversations automatically, using a user simulator, Nova Sonic, and an LLM judge to evaluate the results. This approach ensures that the evaluation is based on predefined criteria rather than exact string matches, which is essential given the non-deterministic nature of voice agent responses. The framework also supports the use of synthetic audio for testing the full speech-to-speech pipeline, providing a more realistic simulation of real-world interactions. The test harness is designed to handle complex aspects like audio-text divergence and session reconnection, ensuring reliable testing even for long conversations. It also includes a model registry that maps short aliases to full model IDs, ensuring that configurations remain consistent even when model versions change. The framework is intended to streamline the development and deployment of voice agents, allowing teams to focus on improving user experience without the overhead of manual QA processes.

The Nova Sonic Test Harness is part of Amazon's broader effort to improve the development and evaluation of voice agents, which are increasingly being used for tasks such as appointment booking, order inquiries, and account management. The system is designed to be flexible, allowing users to define test scenarios using JSON files, which can include details like user personas, tool configurations, and evaluation criteria. This approach enables developers to focus on setting goals and evaluation standards rather than expected outputs, which is crucial given the non-determinitative nature of voice agent responses. The framework also includes features like session continuation and history replay, which ensure that long conversations can be tested without interruption. Overall, the Nova Sonic Test Harness represents a significant step forward in the testing and evaluation of voice agents, offering a scalable and automated solution that reduces the time and effort required for quality assurance.

Source: awsml