Evaluating our RAG-based Chatbot and Iteratively Improving its Reliability

by Eskild Eriksen, Data Engineer

Experimenting with different LLMs and Prompts and Avoiding Regression

A robot evaluating another robot, friendly, playful, teacher and student, cartoon

In our ongoing efforts to improve the user experience for one of our client's web apps, we're experimenting with how to evaluate a chatbot reliability and accuracy. We're doing this using a framework developed by confident-ai.com and its open-source Deepeval framework along with its underlying Ragas evaluation metrics. The aim is to assess the chatbot's responses to common questions across different 7B-parameter open-source models - Llama2, Llama3, Gemma, and Mistral - combined with different prompts tailored to the user's role. We are also looking at comparing and contrasting different embedding models and vector stores; however, those are out of the scope of this blog post.

Many of us working with LLMs are familiar with the different ways these models are benchmarked, such as the Massive Multitask Language Understanding (MMLU) or the TruthfulQA test suites. These tests are useful for understanding the capabilities of the models, but they don't necessarily reflect the performance of the models with your data or in your use case. This is where the Deepeval and Ragas evaluation metrics come in. They allow us to evaluate these models in real-world scenarios and compare the different combinations of LLMs, prompts, embedding models, and vector stores.

This evaluation framework not only enables us to compare different chatbot's against one another but also allows us to think of these questions and answer pairs as unit tests. This process is crucial for enhancing the chatbot's reliability, allowing us to pinpoint areas for improvement and ensure users receive the most accurate and relevant answers.

The software development industry has a long history of using tests to ensure software reliability, and we are now in the early stages of applying this concept to LLM-based applications.

Benchmarking Methodology

We want to design a systematic benchmarking methodology to evaluate the reliability of our chatbot for this experiment, but it was also crucial that we build a framework that could be used to evaluate the chatbot in production.

One of the primary sources of reliable data for our chatbot is the YouTube videos our team has created for training and FAQ purposes. We transcribed these using videos using OpenAI's Whisper model and were very happy with the results. We then worked with the application's subject matter experts (SMEs) to compile a list of common questions and their preferred answers, categorized by user roles. We also worked with our SMEs to generate three prompts for the two different user roles we're targeting.

At present, we don't have direct access to the help desk data, which would help us prioritize the questions most frequently asked by users. This is something we're working on and will be included in future iterations of the chatbot.

With this groundwork established, we initialized our Chatbot using various combinations of LLMs (deployed via Ollama), coupled with role-specific prompts derived from the collected questions and answers. We prompted each chatbot instance to answer the questions three times to gauge consistency and accuracy in its responses. These interactions were recorded and saved in a purpose-built database, along with the specifics of this initialized chatbot and the documents the chatbot relied on to answer the questions. Following the data collection, each chatbot's response was evaluated against the expected answers using the deepeval framework, specifically leveraging the RAGAS metrics, which include contextual precision and contextual recall to evaluate the chatbot's retriever, and faithfulness and relevancy to evaluate the chatbot's generated response. We then saved the results in a separate database table for further analysis.

To capture these evaluation results, deepeval relies on a separate LLM to evaluate the chatbot's responses. While it might be possible to use one of the large open-source LLMs for this purpose, we ran into performance issues and opted to use gpt-3.5-turbo instead. As deepeval points out, "evaluation requires high levels of reasoning" and not all LLMs are capable of this.


Now, onto the results. Each LLM responded to 14 different questions, once for each of the three prompts (depending on the user role), repeated three times, resulting in 126 responses per LLM.

We found that the chatbot's performance varied significantly across the different questions, leading to a lot of variability when calculating the RAGAS scores. We also found that the chatbot's performance varied substantially across the different LLMs, with some performing better than others. We found less variability across the different prompts, but still some.

Comparing the different LLMs, we found that Mistral performed the best in generating faithful and relevant responses, while it's a bit of a mixed bag when it comes to the retriever performance (contextual precision and recall).

RAGAS Scores

Llama3 was released during the course of this experiment and was included at the last minute :)

Of the 14 questions, six were for the first user role, and eight were for the second user role. We didn't find any significant differences in the chatbot's performance across the different user roles, but we did find that the chatbot's performance varied significantly across the different questions.

RAGAS Scores
RAGAS Scores


While the results are less conclusive than we would have liked, they provide some insights into the chatbot's reliability. As we dove into each question, we identified some areas for improvement. For instance, we found that the chatbot performed well when the answer to the question came from a single YouTube video but struggled when the answer required piecing together information from multiple videos.

While it's clear that the right prompt can make a big difference, slightly tweaking the prompts didn't significantly impact the chatbot's response. Ultimately, our prompts were likely too similar to each other, and if we were to continue this experiment, we would likely need to generate more diverse prompts.

At the outset, we considered comparing the results to a chatbot with no vector store (i.e., no context) and a basic prompt to establish a baseline. However, we decided against this as we felt it would not provide any meaningful insights given it would undoubtedly perform poorly. Perhaps this is something to revisit in future iterations for no other reason than to demonstrate the value of the vector store and the prompts.

Iteratively Improving the Chatbot

Again, the results are not as conclusive as we would have liked. Still, they provide some insights into the chatbot's reliability across different LLMs and prompts, especially when we dive into the individual questions. That said, the real value of this evaluation framework is in its ability to improve the chatbot's reliability iteratively and avoid regression.

Deepeval has a great graphic in their documentation that illustrates this concept well:

Iteratively Improving the Chatbot

Software developers have long been accustomed to writing tests to ensure the reliability of their software, and for good reason: these tests are crucial for guaranteeing new features don't break existing functionality while confirming the new code does what it says it does. As an industry, we are in the early stages of applying these concepts to LLM-based applications, and productOps is excited to be at the forefront of this movement.


This exercise has been a great learning experience for our team and has provided us with valuable insights into the reliability of our chatbot. We have identified areas for improvement and have a clear path forward for iteratively improving the chatbot's reliability and accuracy. We are excited to continue this work and to see the chatbot's reliability improve.

How are you evaluating the reliability of your chatbot? We'd love to hear from you.

If you're interested in learning more about how we're using LLMs to solve real-world problems, or if you're interested in working with us to build a chatbot for your application, please reach out to us!

More articles

Building a Retrieval Augmented Generation (RAG) AI Chatbot to Improve Customer Service

In a world that continues to move toward on demand self-service, we worked with a global bicycle manufacturer to improve customer service with less resources, here's how it went down.

Read more

The Trillion Row Challenge: Comparing AWS Serverless Big Data Platforms

We compare the performance and cost of serverless big data platforms for processing a trillion-row dataset.

Read more

Tell us about your project

Our office

  • productOps
    110 Cooper Street
    Suite 201
    Santa Cruz, CA 95060