AWS and LangChain have developed a framework to evaluate AI agents, enabling developers to test and improve their reliability. This framework allows for tracking errors, monitoring in production, and continuous improvement. The post combines insights from LangChain's work on deep agents and Anthropic's guide on evaluating AI agents. Developers can apply five evaluation patterns, build offline evaluations with pytest and LangSmith, and set up online monitoring for production. The example uses a text-to-SQL agent with Amazon Bedrock, leveraging Amazon Nova 2 Lite, a reasoning model with a 1 million-token context window. Nova 2 Lite supports extended thinking with configurable budget levels and handles instruction following, function calling, and code generation well, making it suitable for agentic workloads. The evaluation process involves testing an AI system by providing input, applying grading logic, and measuring success. For agents, this is more complex due to non-determinism, error propagation, and the need for creative solutions. The post outlines three types of graders: code-based, model-based (LLM-as-judge), and human graders. Code-based graders use deterministic logic for verification, while model-based graders use another LLM to assess outputs. Human graders are used for calibration. The framework also explains how to combine these graders for effective evaluation. *Source: [awsml](https://aws.amazon.com/blogs/machine-learning/evalning-deep-agents-using-langsmith-on-aws/)*