AI Agents and a New Approach to Testing

In the world of Generative AI, flashy demos are commonplace, but stable, reliable production systems are rare. Given almost anyone with access to ChatGPT can quickly create an engaging piece of functionality, what is stopping them from taking those systems into production?

Simply put, it’s simple to stand up a system that performs a narrow range of tasks with reasonable performance a reasonable proportion of the time, but “reasonable” doesn’t meet the bar for most customers. This is where the “AI Reliability Stack” comes in. These are the pieces of functionality that give our Intelligence stack the help, and sometimes the redirection, that it needs to meet the high expectations that come with production systems.

The component of the AI Reliability Stack we’re going to talk about today is testing – what testing means in relation to Generative AI applications, and how it differs from testing of other “traditional” development projects.


Testing: Code vs AI

In traditional software development, code sections correspond to specific pieces of functionality. This structure allows developers to make targeted changes to certain features or behaviors with little risk of impacting other systems. In this case, regression testing—checking whether a recent code change has affected existing functionality—can be more focused and predictable.

However, with LLM-based systems, the situation is different. The nature of LLMs is inherently non-deterministic, meaning that the same input can lead to varying outputs under slightly different conditions. The models are designed to learn from vast amounts of data and produce responses based on probabilistic reasoning, so changes to one part of the model can unintentionally affect other parts that seem semantically unrelated. A tweak to how the model processes a specific query type could lead to unexpected shifts in handling an entirely different set of instructions. This interconnectedness presents a considerable challenge when it comes to testing.


How to Test GenAI

For anyone deploying AI agents in production, it’s crucial to consider testing an essential part of the development pipeline. Without rigorous testing, these systems’ performance can be unpredictable and inconsistent—two unacceptable characteristics in production environments. Testing helps ensure that your AI models perform in controlled demo environments and under the wide range of conditions they will encounter in the real world.

However, the traditional methods of software testing—unit testing, regression testing, integration testing, and even red-teaming (the process of challenging the system by trying to break it)—are different when applied to LLMs.

Unit testing generally represents testing specific, precise use cases. With traditional software, this might mean exposing an edge case like “what if a user tries to log in while they’re not connected to the internet?”. With LLM,s this might represent a user attempting to get a chat agent to divulge company secrets, or asking an AI Weatherperson to order them a pizza.

LLM Unit testing means exposing your agent to inputs and monitoring whether its output conforms to what is expected in that case. To this end, many firms find it incredibly useful to implement a system that allows them to “freeze” example inputs that trip up their systems, packaging them up as a unit test to be re-run later to ensure that error does not persist or re-emerge through future iterations. However, if managed in an entirely manual process, these tests can quickly decline in usefulness if the expected interaction pattern between source and agent changes.

Functional Testing is often considered to be an end-to-end test of your system’s functionality. With LLMs, this can be reasonably complex, as these systems are designed to handle a range of different inputs. A chat agent may encounter users with varying tones and moods, or an AI conducting phone calls might encounter varying degrees of audio quality.

Testing models end-to-end is frequently done, in the early stages of development, manually. Automated testing, though, is the only way to ensure a reasonable degree of coverage. At Auril AI, we strongly believe that in order to perform end-to-end testing of an LLM, one must test with LLMs. Though the notion of an AI chatting with an AI may seem circular or redundant, this process of automatically generating and evaluating simulated end-to-end interactions with an AI designed to represent various user profiles, and having those test cases monitored by humans, is an amazing way to multiply the efficacy of the limited humans you have available to deploy in service of testing your system.

Integration Testing involves ensuring that your systems interact appropriately with one another, and with external systems.

With LLMs, Integration Tests must ensure that multiple parts of an AI-driven pipeline work together as expected. As many AI Agents are tasked with generating JSON as part of an API, tests which validate the form and content of those JSON objects as being valid and sensible are vital to the process of Integration testing when it comes to LLMs.

Furthermore, increasingly complex AI systems may involve complex Agentic architectures, wherein AI agents must interact with other AI agents in order to accomplish complex tasks. In those situations, Integration Testing is vital to ensure that the system of agents performs as well as any individual component.

These testing need to change drastically based on the type of GenAI application you’re building, and the techniques you’re deploying in service of that project. For example, if you’re deploying a RAG application, you might want to deploy tests which perform token classification in order to ensure information being generated by your agent is in fact coming from your sources, and that it’s being done in an appropriate way.


At Auril AI, we continuously track and adapt to the evolving landscape of tools and techniques for testing AI systems. We understand that bringing AI to market is about more than building impressive models—it’s about creating systems that perform reliably, day in and day out, in production. This is why our approach to testing is dynamic, thorough, and focused on the specific needs of LLM-driven architectures.

If you are working on AI solutions and need a partner to help you ensure your system’s reliability, let’s connect. Our expertise in testing AI systems can help you transform your demos into dependable, production-ready tools that your customers can trust.