Hướng dẫn sử dụng DeepEval để tự động hóa Unit Testing cho ứng dụng LLM – ITFROMZERO

2 giờ sáng, staging server báo lỗi. Chatbot trả lời sai — không phải crash, không phải exception, chỉ đơn giản là nói nhảm. Response trông ổn về mặt kỹ thuật nhưng nội dung hoàn toàn hallucinate. Không có log nào bắt được. Không có test nào fail.

Đó là lúc mình bắt đầu nghiêm túc tìm cách test LLM như test code thông thường. Và DeepEval là thứ mình tìm được.

Table of Contents

Chạy được trong 5 phút

Không cần đọc document dài. Cài xong, chạy thử ngay:

pip install deepeval
deepeval login  # Tùy chọn — để xem dashboard trên Confident AI

Tạo file test_chatbot.py:

from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

def test_answer_relevancy():
    metric = AnswerRelevancyMetric(threshold=0.7)
    test_case = LLMTestCase(
        input="Redis dùng để làm gì?",
        actual_output="Redis là in-memory data store dùng cho caching, session management và pub/sub."
    )
    assert_test(test_case, [metric])

Chạy:

deepeval test run test_chatbot.py

Xong. Test pass nghĩa là câu trả lời relevant với câu hỏi. Test fail nghĩa là AI đang trả lời lạc đề — đây chính xác là kiểu lỗi không có unit test nào bắt được trước đây.

DeepEval hoạt động như thế nào

Thay vì assert output bằng string matching (vốn brittle với LLM), DeepEval dùng một LLM khác — judge model — để đánh giá chất lượng theo từng tiêu chí cụ thể. Mỗi test gọi judge model 1-2 lần, mất khoảng 2-5 giây.

Mỗi test gồm 3 thành phần:

LLMTestCase — bộ dữ liệu đầu vào/ra cho một test case
Metric — tiêu chí đánh giá (relevancy, faithfulness, hallucination…)
Threshold — ngưỡng pass/fail (0.0 đến 1.0)

Các metric quan trọng nhất

AnswerRelevancyMetric — câu trả lời có liên quan đến câu hỏi không? Dùng khi test chatbot tổng quát.

from deepeval.metrics import AnswerRelevancyMetric
metric = AnswerRelevancyMetric(threshold=0.7)

FaithfulnessMetric — quan trọng nhất với RAG pipeline. Nó kiểm tra xem câu trả lời có trung thực với context được cung cấp không, bắt được đúng kiểu hallucination mà AnswerRelevancy bỏ qua.

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="Redis có hỗ trợ persistent storage không?",
    actual_output="Redis hoàn toàn không lưu dữ liệu xuống disk.",  # Sai!
    retrieval_context=[
        "Redis hỗ trợ RDB snapshots và AOF logging để persistent storage."
    ]
)
metric = FaithfulnessMetric(threshold=0.8)

Test case trên sẽ fail vì actual_output mâu thuẫn với context — chính xác kiểu hallucination mình gặp lúc 2 giờ sáng đó.

ContextualRelevancyMetric — kiểm tra chất lượng retrieval, không phải generation. Nếu FaithfulnessMetric fail, metric này giúp phân biệt lỗi nằm ở retriever hay ở generator.

GEval — metric tùy chỉnh bằng ngôn ngữ tự nhiên. Mạnh nhất nhưng tốn token nhất:

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Câu trả lời có chính xác về mặt kỹ thuật không?",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7
)

Tích hợp vào pytest workflow thực tế

DeepEval tích hợp native với pytest — không cần học API mới:

import pytest
from deepeval import assert_test
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Giả sử đây là function gọi LLM của bạn
from your_app import get_rag_response

@pytest.mark.parametrize("question,context,expected_topics", [
    (
        "Docker khác VM như thế nào?",
        ["Docker dùng container isolation, chia sẻ kernel với host OS, nhẹ hơn VM."],
        ["container", "kernel"]
    ),
    (
        "Kubernetes dùng để làm gì?",
        ["Kubernetes là container orchestration platform, tự động deploy, scale và manage containers."],
        ["orchestration", "deploy"]
    ),
])
def test_rag_quality(question, context, expected_topics):
    response = get_rag_response(question, context)
    
    test_case = LLMTestCase(
        input=question,
        actual_output=response,
        retrieval_context=context
    )
    
    assert_test(test_case, [
        AnswerRelevancyMetric(threshold=0.7),
        FaithfulnessMetric(threshold=0.8)
    ])

Chạy như pytest bình thường:

deepeval test run test_rag_quality.py -v
# hoặc
pytest test_rag_quality.py --deepeval  # Nếu muốn giữ pytest output format

Dataset-based evaluation

Có hơn 10 test case? EvaluationDataset giúp quản lý gọn hơn:

from deepeval.dataset import EvaluationDataset
from deepeval.test_case import LLMTestCase

dataset = EvaluationDataset(test_cases=[
    LLMTestCase(
        input="Git rebase khác merge như thế nào?",
        actual_output=get_response("Git rebase khác merge như thế nào?"),
        expected_output="Rebase viết lại commit history, merge tạo merge commit mới."
    ),
    LLMTestCase(
        input="SSH key-based authentication hoạt động ra sao?",
        actual_output=get_response("SSH key-based authentication hoạt động ra sao?"),
        expected_output="Dùng public/private key pair, server verify bằng public key."
    )
])

dataset.evaluate([AnswerRelevancyMetric(threshold=0.7)])

Tích hợp CI/CD

Đây là workflow mình đang chạy trên production. Trigger mỗi khi có PR thay đổi prompt hoặc LLM code:

# .github/workflows/llm-eval.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'app/llm/**'

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      
      - name: Install dependencies
        run: pip install deepeval anthropic
      
      - name: Run LLM evaluations
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}  # DeepEval dùng GPT làm judge mặc định
        run: deepeval test run tests/llm/

Kết quả: mỗi PR thay đổi prompt đều phải qua quality gate. Không còn deploy lên production rồi mới phát hiện chatbot nói nhảm.

Tips thực tế

1. Chọn judge model phù hợp với ngân sách

DeepEval mặc định dùng GPT-4o làm judge. Đắt. Có thể override:

from deepeval.models import GPTModel

# Dùng GPT-4o-mini rẻ hơn ~30x
metric = AnswerRelevancyMetric(
    threshold=0.7,
    model=GPTModel(model="gpt-4o-mini")
)

Hoặc dùng Claude làm judge nếu app của bạn đang dùng OpenAI — tránh conflict of interest:

from deepeval.models.base_model import DeepEvalBaseLLM
import anthropic

class ClaudeJudge(DeepEvalBaseLLM):
    def __init__(self):
        self.client = anthropic.Anthropic()
    
    def load_model(self):
        return self.client
    
    def generate(self, prompt: str) -> str:
        message = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return message.content[0].text
    
    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)
    
    def get_model_name(self):
        return "claude-haiku"

2. Đừng test tất cả metric mọi lúc

Mỗi metric = 1-2 LLM call = tốn tiền và thời gian. Phân loại test:

Unit test nhanh (chạy mỗi commit): chỉ AnswerRelevancyMetric
Integration test (chạy mỗi PR): thêm FaithfulnessMetric
Full eval (chạy hàng tuần hoặc trước release): tất cả metrics + GEval custom

3. Lưu test case vào file YAML/JSON

Tránh hardcode test data trong Python:

from deepeval.dataset import EvaluationDataset

# Load từ file
dataset = EvaluationDataset()
dataset.pull(alias="production-eval-set")  # Nếu dùng Confident AI

# Hoặc từ local JSON
import json
with open("test_cases.json") as f:
    cases = json.load(f)
    
dataset = EvaluationDataset(test_cases=[
    LLMTestCase(**case) for case in cases
])

4. Threshold nên được tune theo domain

0.7 là điểm xuất phát an toàn, nhưng không phải số magic. Cách thực tế hơn: chạy evaluation trên tập golden dataset trước, xem distribution của score. Sau đó set threshold ở khoảng percentile 10-15 từ dưới lên — không phải số tròn.

Lưu ý thêm: câu hỏi kỹ thuật chuyên sâu thường score thấp hơn câu hỏi đơn giản ngay cả khi answer đúng. Judge model cũng có bias — đừng treat score như sự thật tuyệt đối.

Sau vài tuần dùng DeepEval, mình không còn lo mỗi lần thay đổi system prompt nữa. Thay đổi xong, chạy test, xem score thay đổi thế nào. Nếu FaithfulnessMetric tụt xuống dưới threshold là biết ngay prompt mới đang khiến model hallucinate nhiều hơn — trước khi deploy lên bất kỳ đâu.