Evaluating RAG with RAGAS: From ‘Gut Feeling’ to Faithfulness, Answer Relevance, and Context Precision Metrics – ITFROMZERO

Table of Contents

Background: Why You Shouldn’t Rely on Gut Feeling in RAG

Building a RAG bot with LangChain or LlamaIndex isn’t hard. The real challenge is answering the question: “Is it actually performing well?”.

When I first started, I used to test manually by asking about 20 questions and reviewing the results myself. This approach is extremely labor-intensive and subjective. Sometimes the bot answers smoothly, but a slight change in the Prompt can cause it to start “hallucinating” immediately.

RAGAS (RAG Assessment) was created to solve this problem. Instead of vague statements like “The bot works fine,” you can confidently say: “Faithfulness is 0.92/1.0.” This framework helps quantify RAG quality through concrete metrics.

RAG systems usually fail at two stages: Retrieval (finding data) and Generation (synthesizing the answer). RAGAS provides a set of metrics to scrutinize each of these links.

Installing the RAGAS Library

You’ll need a clean Python environment. RAGAS works very well with LangChain, but it also functions independently of whatever framework you are using.

pip install ragas datasets langchain-openai

RAGAS uses a powerful LLM (defaulting to GPT-4o) as a “judge” for scoring. To get started, configure your OpenAI API Key or use Ollama if you want to run locally to save costs.

import os
os.environ["OPENAI_API_KEY"] = "sk-your-key"

3 “Golden” Metrics for System Optimization

Based on practical project implementation experience, here are the 3 most important metrics you need to monitor closely.

1. Faithfulness

This metric checks whether the answer is actually based on the input data (context). It helps prevent the AI from fabricating information not found in the documents.

How it works: RAGAS extracts the main points from the answer and cross-references them with the context. Any point without evidence results in a direct point deduction.
Example: If the context says revenue is 10 billion, but the bot answers 12 billion, the Faithfulness score will drop close to 0.

2. Answer Relevance

Sometimes AI provides a factually correct answer but goes… off-topic. Answer Relevance measures how relevant the answer is to the original question.

How it works: RAGAS reverse-generates questions from the current answer. It then calculates the similarity between the generated questions and the user’s original question.
Note: If the bot is too wordy or rambles, this score will be low even if the information is correct.

3. Context Precision

This metric is specifically for evaluating the Retrieval component. It checks whether the retrieved text chunks contain the necessary information and if the priority order is correct.

Suppose you retrieve the Top 5 relevant documents. If the most important information is at position 5 instead of position 1, the Context Precision score will be significantly affected.

Practical Implementation

To run a test, you need to prepare a dataset consisting of: Question, AI Answer, Retrieved Context, and Ground Truth.

Below is a sample code snippet to quickly run a test suite:

from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance, context_precision

data_samples = {
    'question': ['What is the capital of France?', 'Who invented the light bulb?'],
    'answer': ['Paris.', 'Thomas Edison developed the commercial light bulb.'],
    'contexts': [
        ['Paris is the capital of France.'],
        ['Edison was not the first but optimized the incandescent light bulb.']
    ],
    'ground_truth': ['Paris', 'Thomas Edison']
}

dataset = Dataset.from_dict(data_samples)
score = evaluate(dataset, metrics=[faithfulness, answer_relevance, context_precision])
print(score.to_pandas())

In practice, I often integrate this step into the CI/CD pipeline on GitHub Actions. If the average score after a Prompt update drops below 0.8, the system automatically blocks the code merge.

Hard-Earned Tips When Using RAGAS

After several projects, I’ve gathered some experience to avoid wasting money and to get accurate results:

Curate your test set: Don’t evaluate the entire database of thousands of questions. Select about 30-50 representative questions (Golden Dataset) to run periodically.
Prioritize GPT-4o: While GPT-3.5 is cheaper, its ability to score Faithfulness is quite poor. Using a weak model as a judge will lead to seriously biased results.
Control costs: Scoring 100 samples with GPT-4o can cost around $1-2. Consider the frequency of testing to optimize your budget.
Handling low scores: If Context Precision is low, review your Chunking or Embedding techniques. If Faithfulness is low, tighten your System Prompt.

What gets measured gets improved. Instead of saying “I think the bot is fine,” provide your clients with a report featuring clear metric charts. Good luck optimizing your bot!