A Guide to Automated LLM Unit Testing with DeepEval

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

2 AM, the staging server flagged an error. The chatbot was giving wrong answers — not a crash, not an exception, just plain nonsense. The response looked technically fine, but the content was completely hallucinated. No logs caught it. No tests failed.

That was when I started seriously looking for a way to test LLMs the same way you test regular code. DeepEval is what I found.

Up and Running in 5 Minutes

No need to wade through lengthy docs. Install and run immediately:

pip install deepeval
deepeval login  # Optional — to view the dashboard on Confident AI

Create a test_chatbot.py file:

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    metric = AnswerRelevancyMetric(threshold=0.7)
    test_case = LLMTestCase(
        input="What is Redis used for?",
        actual_output="Redis is an in-memory data store used for caching, session management, and pub/sub."
    )
    assert_test(test_case, [metric])

Run it:

deepeval test run test_chatbot.py

Done. A passing test means the answer is relevant to the question. A failing test means the AI is going off-topic — exactly the kind of bug that no unit test could catch before.

How DeepEval Works

Instead of asserting output using string matching (which is brittle with LLMs), DeepEval uses a separate LLM — a judge model — to evaluate quality against specific criteria. Each test calls the judge model 1-2 times and takes roughly 2-5 seconds.

Each test consists of three components:

  • LLMTestCase — the input/output data for a single test case
  • Metric — the evaluation criterion (relevancy, faithfulness, hallucination…)
  • Threshold — the pass/fail cutoff (0.0 to 1.0)

The Most Important Metrics

AnswerRelevancyMetric — does the answer relate to the question? Use this when testing general-purpose chatbots.

from deepeval.metrics import AnswerRelevancyMetric
metric = AnswerRelevancyMetric(threshold=0.7)

FaithfulnessMetric — the most critical metric for RAG pipelines. It checks whether the answer is faithful to the provided context, catching exactly the kind of hallucination that AnswerRelevancy misses.

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="Does Redis support persistent storage?",
    actual_output="Redis does not persist any data to disk.",  # Wrong!
    retrieval_context=[
        "Redis supports RDB snapshots and AOF logging for persistent storage."
    ]
)
metric = FaithfulnessMetric(threshold=0.8)

The test case above will fail because the actual_output contradicts the context — exactly the hallucination I ran into at 2 AM.

ContextualRelevancyMetric — evaluates retrieval quality, not generation. When FaithfulnessMetric fails, this metric helps you pinpoint whether the problem lies in the retriever or the generator.

GEval — a custom metric defined in natural language. The most powerful option, but also the most token-intensive:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Is the answer technically accurate?",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)

Integrating into a Real pytest Workflow

DeepEval integrates natively with pytest — no new API to learn:

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Assume this is your function that calls the LLM
from your_app import get_rag_response

@pytest.mark.parametrize("question,context,expected_topics", [
    (
        "How is Docker different from a VM?",
        ["Docker uses container isolation, shares the kernel with the host OS, and is lighter than a VM."],
        ["container", "kernel"]
    ),
    (
        "What is Kubernetes used for?",
        ["Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containers."],
        ["orchestration", "deploy"]
    ),
])
def test_rag_quality(question, context, expected_topics):
    response = get_rag_response(question, context)
    
    test_case = LLMTestCase(
        input=question,
        actual_output=response,
        retrieval_context=context
    )
    
    assert_test(test_case, [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8)
    ])

Run it just like regular pytest:

deepeval test run test_rag_quality.py -v
# or
pytest test_rag_quality.py --deepeval  # If you want to keep the pytest output format

Dataset-based Evaluation

Have more than 10 test cases? EvaluationDataset keeps things organized:

from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase

dataset = EvaluationDataset(test_cases=[
    LLMTestCase(
        input="How does Git rebase differ from merge?",
        actual_output=get_response("How does Git rebase differ from merge?"),
        expected_output="Rebase rewrites commit history, while merge creates a new merge commit."
    ),
    LLMTestCase(
        input="How does SSH key-based authentication work?",
        actual_output=get_response("How does SSH key-based authentication work?"),
        expected_output="Uses a public/private key pair; the server verifies using the public key."
    )
])

dataset.evaluate([AnswerRelevancyMetric(threshold=0.7)])

CI/CD Integration

Here’s the workflow I run in production. It triggers on every PR that modifies a prompt or LLM code:

# .github/workflows/llm-eval.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'app/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install deepeval anthropic
      
      - name: Run LLM evaluations
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}  # DeepEval uses GPT as the default judge
        run: deepeval test run tests/llm/

The result: every prompt-changing PR has to pass a quality gate. No more shipping to production only to discover the chatbot is talking nonsense.

Practical Tips

1. Choose a Judge Model That Fits Your Budget

DeepEval uses GPT-4o as the judge by default. That’s expensive. You can override it:

from deepeval.models import GPTModel

# Use GPT-4o-mini, ~30x cheaper
metric = AnswerRelevancyMetric(
    threshold=0.7,
    model=GPTModel(model="gpt-4o-mini")
)

Or use Claude as the judge if your app runs on OpenAI — avoiding any conflict of interest:

from deepeval.models.base_model import DeepEvalBaseLLM
import anthropic

class ClaudeJudge(DeepEvalBaseLLM):
    def __init__(self):
        self.client = anthropic.Anthropic()
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        message = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content[0].text
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return "claude-haiku"

2. Don’t Run Every Metric Every Time

Each metric means 1-2 LLM calls — that costs money and time. Tier your tests:

  • Fast unit tests (run on every commit): only AnswerRelevancyMetric
  • Integration tests (run on every PR): add FaithfulnessMetric
  • Full eval (run weekly or before a release): all metrics + custom GEval

3. Store Test Cases in YAML/JSON Files

Avoid hardcoding test data in Python:

from deepeval.dataset import EvaluationDataset

# Load from file
dataset = EvaluationDataset()
dataset.pull(alias="production-eval-set")  # If using Confident AI

# Or from a local JSON file
import json
with open("test_cases.json") as f:
    cases = json.load(f)
    
dataset = EvaluationDataset(test_cases=[
    LLMTestCase(**case) for case in cases
])

4. Tune Thresholds for Your Domain

0.7 is a safe starting point, but it’s not a magic number. A better approach: run an evaluation on your golden dataset first and look at the score distribution. Then set your threshold around the 10th–15th percentile from the bottom — not a round number.

One more thing: deep technical questions tend to score lower than simple ones even when the answer is correct. Judge models also have biases — don’t treat the score as absolute truth.

After a few weeks with DeepEval, I no longer stress every time I change a system prompt. Make the change, run the tests, see how the scores shift. If FaithfulnessMetric drops below the threshold, I know immediately that the new prompt is causing more hallucinations — before it ever ships anywhere.

Share: