August 15, 2024

Gen AI 101: Testing and Monitoring (Part 4)

By Lawrence Liu

The hype around generative AI has shifted the industry narrative overnight from the big data era of “every company is a data company” to the new belief that “every company is an AI company.” Despite the excitement, deploying enterprise generative AI applications to production has come along at a less-than-metaphoric pace. 

One main reason for the slow rollout to production is the corporate risks of unexpected LLM behaviors. 

For a few early adopters, LLMs have been known to hallucinate, creating unfavorable publicity for the companies behind these products. For example, Pak’nSave, a New Zealand grocery chain, released an AI Recipe companion that suggested a recipe for chlorine gas to a user. Even though this example involved some user foul play, it is not a bold statement to say that a customer-facing AI chatbot can be a hazardous product to release into the wild without thorough testing and monitoring. 

This blog provides an overview of applying software engineering best practices to build a test validation and monitoring suite for a non-deterministic generative AI application.

Why is Testing and Monitoring Necessary?

In traditional machine learning, data pipelines feeding into the model have queries written with idempotency in mind, and data validation checks are performed before and after inference to confirm an expected output. 

Generative AI, on the other hand, is an entirely different animal. From an ML practitioner’s perspective, an LLM is treated as a black box that can accept natural language prompts as input and returns responses in various formats, including natural language, code, and JSON. The response of an LLM can vary between runs despite being provided with the exact same prompt.

While generative AI may sometimes feel like black magic, it is important to recognize that it is still software. Well-built software applications require a comprehensive testing suite to validate that the underlying code operates as expected. 

Establishing a test validation suite is imperative for generative AI applications to enable rapid iterative development. Additionally, implementing a monitoring suite is imperative for gaining visibility into an application’s performance in production.

What Are the Components of a Testing and Monitoring Suite?

Attribution

A generative AI application is constructed using modular components that transfer, process, and retrieve data. There is the original user query, the processing done before prompting the LLM (including retrieving relevant information from the data store), re-ranking the retrieved information, and also post-processing of the LLM output. 

Logging needs to occur at each of these components, with details including what data was fed to each step and how it was transformed. Logging is not a new concept introduced by generative AI applications; it is simply an industry standard for all software applications. 

Without logging, it would be difficult to determine which component of the application attributed to unexpected application behavior. Certain aspects of a test suite will also rely on these logs to validate expected behavior.

Validating the Retrieval Mechanism

When developing an RAG application, the retrieval mechanism, designed to return the Top K most relevant pieces of context to help answer a user query, plays a major role in the AI application’s performance. “K” refers to the number of relevant chunks pulled. Including too many chunks in the prompt context can lead to hallucinations. This is especially prevalent with smaller LLMs that are prone to interpreting the multiple chunks as information from one source rather than distinct documents that may be unrelated.

A retrieval mechanism that fails to retrieve the most relevant chunks of information can cause LLMs to produce a sub-par response or completely hallucinate. In this context, the old chestnut “Garbage In, Garbage Out” has never been more true. To validate the retrieval mechanism, build a test set from a curated dataset containing user prompts and their corresponding known most relevant chunks. It is important here to understand what the Top K chunk limit is for your LLM and adjust the test set and retrieval mechanism appropriately. Test the retrieval mechanism by inputting the prompts and checking if the retrieved chunks match the expected ones. 

Retrieval mechanisms are inherent features of the search engines and vector database data store offerings. Testing the retrieval mechanism is not meant to validate some in-house custom-built logic but to learn and validate the best retrieval mechanism for a given scenario. For example, suppose a vector database’s default retrieval strategy (an approximate nearest neighbor algorithm) performed poorly against the test set. Using the same test set, developers can test other retrieval strategies, such as hybrid search and exact nearest neighbor, to find the strategy that best balances accuracy and speed.

Validating the Prompt and LLM Output

Building a test suite to validate the accuracy of an LLM is an overlooked portion of generative AI that is currently playing catchup to the rest of the field. LLMs themselves are non-deterministic black boxes composed of a series of weights that can output responses in a fairly consistent manner.

It is certainly possible to construct a test data set of prompts and expected model outputs. However, given the natural language aspect of LLMs, comparing the actual output to the expected output is not as simple as matching the exact criteria. After all, the beauty of language is that words can convey the same message in many different ways, all of which can be accurate. 

One approach to handling the nuances of language is to use embedding models to determine the semantic similarity between the actual output and expected model output by comparing vector similarity. A similar but more sophisticated approach is the Bert Score, which leverages a contextual embedding model to produce precision, recall, and f1 values when comparing a reference text and candidate text. The Bert Score correlates well with human judgment but can perform poorly when comparison requires understanding the context between idiomatic expressions or cultural references.

A novel validation method that is gaining popularity involves employing an LLM as a judge to evaluate the responses of another LLM. The LLMs serving as judges should be stronger models compared to the models being judged. This concept is analogous to the role of a teacher who ideally possesses greater knowledge than the student whose work they are grading. 

AI validation platform offerings such as Galileo and Arize are using LLMs under the hood to judge model outputs against metrics including:

 

  • Context adherence: A metric that indicates whether the response is based on the relevant context provided in the prompt or hallucinated by the LLM.

  • Completeness: A metric that indicates whether the response addresses all parts of the question.

It is important to monitor and evaluate LLM responses in production. However, using LLMs as judges can end up costing more than the cost of core generative AI applications, especially if every model response is being evaluated. Stronger foundational models are better evaluators, but they are also more expensive and can increase application latency. 

To decrease long-term validation costs, teams can consider judging only a sample of model responses in production, fine-tuning a smaller model for the specific evaluation task, or using a weaker judge with a fine-tuned validation prompt. A fine-tuned prompt with ChatGPT-3.5 can have a comparable judging performance with ChatGPT-4 with one-tenth of the cost and three times the speed.

In addition to monitoring model output accuracy, it is equally important to monitor metrics that gauge sentiment—ranging from positive to neutral to negative—and toxicity. Implementing monitoring for such metrics will help product teams understand how undesirable model responses are being elicited and build guardrails to prevent improper behavior in production.

Validating the Data Engineering Strategy

There is no one-size-fits-all approach to chunking unstructured data. Chunks that are too large introduce unnecessary length to the prompt, and chunks that are too small may not provide enough context. Different chunking strategies are discussed in our data engineering blog, but the strategies have not been evaluated.

To test a chunking strategy, load the data store with the vectors created from the chunking strategy, feed a test dataset of user prompts to the application, and then compare the model output to the retrieved information in the prompt that was provided to the model. 

Using the eye test, you can guesstimate if the chunking strategy performed adequately based on whether each chunk provided enough context, but this is a slow and manual process. To speed things up, similar to how LLMs-as-a-judge evaluated output accuracy with metrics such as completeness and context adherence, they can also evaluate chunking strategy with metrics including: 

  • Chunk Attribution: A metric that calculates how many pieces of information were actually used in constructing the answer. This metric would be used to decide whether more or less documents are needed to provide relevant context. 

  • Chunk Utilization: A metric that calculates the percentage of information used when constructing the answer. A low chunk utilization would indicate that smaller chunk sizes would perform just as well, and a high chunk utilization would be an indicator to increase the information chunk sizes in the data store.

Validation Suite Product Offerings

There are many open-source and SaaS offers that provide generative AI validation capabilities and tooling. 

For prompt-specific validation, PromptLayer is a SaaS platform that allows quick iteration for prompt engineering by providing a framework to evaluate and manage prompts decoupled from the main application codebase. Promptfoo, an open-source offering, provides a framework that allows developers to use built-in, LLM-graded, or custom metrics to evaluate prompts and LLM performance.

For logging and monitoring a generative AI application from a more holistic approach, Phoenix, an open-source offering by Arize, allows developers to evaluate and troubleshoot an application locally in a notebook. However, Phoenix is only meant to be used as a tool in development. For applications in production, Arize has built out LLM monitoring and observability functionality into its SaaS platform, which provides evaluation metrics on how the separate systems in the application are performing as well as troubleshooting capabilities. Arize may be the first choice for Gen AI monitoring in production for many enterprises who have already integrated Arize for traditional ML monitoring use cases.

Galileo is a SaaS offering designed to evaluate RAG generative AI applications by using LLM-as-a-judge to produce many of the metrics we have discussed above including but not limited to completeness, correctness, context adherence, context relevance, chunk attribution, and chunk utilization. 

Conclusion

Testing and monitoring are essential for a generative AI application, enabling swift iterative development and ensuring confidence in application performance in production. The different components of a generative AI application need to be tested and monitored both individually and in tandem with other components. 

Given the natural language and non-deterministic behavior of generative AI, frameworks and metrics designed to automate the testing and monitoring of AI applications have emerged to tune applications to expected behavior. Monitoring is in place to ensure that expected behavior is consistent and in production, the ability to gain insight if not & iterate.

This blog concludes our 101 series on building generative AI applications, which covers the considerations that enterprises need to make when building production-ready applications. 

If you missed the other blogs in the series, definitely check them out!

If your business is interested in an in-depth look into how it can best leverage Gen AI, we strongly recommend signing up for one of our free generative AI workshops. In this 90-minute session, your team can virtually meet with one of our Principal Data and AI Architects to start turning your AI ideas into a plan of action.

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit