Dùng Vision LLM (Moondream, LLaVA) phân tích ảnh lỗi UI tự động — Hướng dẫn thực chiến – ITFROMZERO

Mình từng mất gần 2 tiếng debug một lỗi UI chỉ vì tester gửi ảnh chụp màn hình kèm mô tả “bị lỗi rồi anh ơi” — không có stack trace, không có log, chỉ có ảnh. Đó là lúc mình bắt đầu tìm cách tự động hóa việc phân tích ảnh lỗi bằng Vision LLM.

Vision LLM không còn là thứ chỉ dùng để nhận dạng mèo hay chó nữa. Bạn có thể mô tả yêu cầu bằng ngôn ngữ tự nhiên — “liệt kê các lỗi UI trong ảnh này, trả về JSON” — và model trả về đúng thứ bạn cần. Moondream 2, ví dụ, chỉ nặng ~2GB nhưng đọc được error message, HTTP status code, thậm chí trạng thái disabled/enabled của từng button trong ảnh.

Table of Contents

So sánh các approach để phân tích ảnh lỗi UI

Có 3 hướng tiếp cận phổ biến — mình sẽ nói thẳng ưu nhược từng cái để bạn tự chọn:

Approach 1: OCR truyền thống (Tesseract, EasyOCR)

Dùng OCR để trích xuất text từ ảnh, sau đó parse text đó. Nhanh, chạy offline hoàn toàn, không tốn tiền API.

Ưu: Nhẹ, free, không cần GPU mạnh
Nhược: Chỉ đọc text thuần túy, không hiểu context. Không phân biệt được “đây là nút bị disabled” hay “đây là thông báo lỗi” hay “đây là label bình thường”. Ảnh chất lượng thấp hoặc font lạ là xong.

Approach 2: Cloud Vision API (Google Vision, AWS Rekognition, Azure Computer Vision)

Gọi API của các cloud provider để phân tích ảnh. Kết quả tốt hơn OCR thuần túy, detect được text, objects, labels.

Ưu: Độ chính xác cao, không cần setup model
Nhược: Tốn tiền theo lượt gọi (Google Vision ~$1.5/1000 ảnh), ảnh phải gửi lên cloud — vấn đề với dữ liệu nhạy cảm, vẫn không hiểu ngữ cảnh UI tốt

Approach 3: Vision LLM (Moondream, LLaVA, GPT-4 Vision)

Dùng model ngôn ngữ lớn có khả năng hiểu ảnh. Mô tả yêu cầu bằng tiếng Anh, model trả về phân tích có cấu trúc.

Ưu: Hiểu ngữ cảnh UI rất tốt — phân biệt được error message, button state, form validation, modal dialog. Có thể trả về JSON có cấu trúc. Có thể chạy local (Moondream, LLaVA).
Nhược: Model lớn cần RAM/VRAM nhiều hơn. Tốc độ chậm hơn OCR thuần nếu chạy local trên CPU.

Khi nào nên chọn Vision LLM?

Nếu bạn chỉ cần extract text đơn thuần từ ảnh có chất lượng tốt → OCR là đủ. Nhưng với ảnh chụp màn hình UI lỗi, bạn cần hơn thế:

Phân biệt loại lỗi: validation error, network error, permission error, crash
Xác định component bị lỗi: form field nào, button nào, API endpoint nào
Trích xuất error code, HTTP status, stack trace nếu có trong ảnh
Mô tả context: user đang ở màn hình nào, làm gì trước khi lỗi xảy ra

Đây là bài toán hiểu ngữ cảnh, không phải chỉ đọc text — và Vision LLM làm việc này tốt hơn hẳn.

Chọn model nào: Moondream vs LLaVA vs GPT-4 Vision?

Model	Chạy local	RAM cần	Độ chính xác	Tốc độ (CPU)
Moondream 2	✅	~2GB	Khá tốt	Nhanh nhất
LLaVA 7B	✅	~8GB	Tốt	Trung bình
LLaVA 13B	✅	~16GB	Rất tốt	Chậm
GPT-4 Vision	❌ (API)	N/A	Tốt nhất	Nhanh (API)

Cho môi trường production với dữ liệu nhạy cảm: Moondream 2 là điểm khởi đầu tốt — nhỏ gọn, chạy được trên CPU bình thường, đủ tốt cho UI screenshot. LLaVA 7B nếu bạn có GPU hoặc cần độ chính xác cao hơn.

Hướng dẫn triển khai: Phân tích ảnh lỗi UI với Moondream

Bước 1: Cài đặt môi trường

# Tạo virtualenv
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Cài dependencies
pip install transformers torch pillow einops

Nếu không có GPU, cài torch CPU-only cho nhẹ hơn:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Bước 2: Script phân tích ảnh lỗi cơ bản

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import json

# Load model (lần đầu sẽ download ~2GB)
model_id = "vikhyatk/moondream2"
revision = "2025-01-09"  # Dùng revision cụ thể để tránh breaking change

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision
)

def analyze_error_screenshot(image_path: str) -> dict:
    """Phân tích ảnh chụp màn hình lỗi UI, trả về dict có cấu trúc."""
    image = Image.open(image_path).convert("RGB")
    enc_image = model.encode_image(image)

    # Prompt tiếng Anh thường cho kết quả tốt hơn với Vision LLM
    prompt = """Analyze this UI error screenshot. Extract:
1. Error type (validation/network/permission/crash/other)
2. Error message text (exact if visible)
3. HTTP status code or error code if visible
4. Which UI component has the error (button/form field/page/modal)
5. Any stack trace or technical details visible

Respond in JSON format:
{"error_type": "", "error_message": "", "error_code": "", "component": "", "details": ""}"""

    result = model.answer_question(enc_image, prompt, tokenizer)

    # Parse JSON từ response
    try:
        # Tìm JSON trong response (model đôi khi thêm text thừa)
        start = result.find("{")
        end = result.rfind("}") + 1
        if start != -1 and end > start:
            return json.loads(result[start:end])
    except json.JSONDecodeError:
        pass

    # Fallback: trả về raw text nếu parse JSON thất bại
    return {"raw_analysis": result}


# Test
if __name__ == "__main__":
    result = analyze_error_screenshot("error_screenshot.png")
    print(json.dumps(result, indent=2, ensure_ascii=False))

Bước 3: Tích hợp với Ollama để dùng LLaVA (tuỳ chọn)

Đã có Ollama chạy sẵn rồi thì dùng LLaVA qua API local còn đơn giản hơn nhiều so với load model thủ công:

# Pull model LLaVA
ollama pull llava:7b

import requests
import base64
import json
from pathlib import Path

def analyze_with_ollama(image_path: str, model: str = "llava:7b") -> dict:
    """Phân tích ảnh lỗi qua Ollama API."""
    # Encode ảnh thành base64
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")

    prompt = """Look at this UI error screenshot carefully.
Extract the following information and return ONLY valid JSON:
{
  "error_type": "validation|network|permission|crash|unknown",
  "error_message": "exact error text visible in the image",
  "error_code": "HTTP status or error code if visible, else null",
  "affected_component": "which UI element has the error",
  "suggested_cause": "brief technical cause based on what you see"
}"""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "images": [image_data],
            "stream": False,
            "format": "json"  # Ollama buộc output JSON
        },
        timeout=60
    )
    response.raise_for_status()
    return response.json()["response"]

# Sử dụng
result = analyze_with_ollama("ui_error.png")
print(result)

Bước 4: Xây dựng pipeline xử lý batch ảnh lỗi

Tester gửi cả folder 20–30 ảnh cùng lúc là chuyện bình thường. Script dưới xử lý hết một lần, xuất report JSON:

import os
import json
from pathlib import Path
from datetime import datetime

def process_error_screenshots_folder(folder_path: str, output_file: str = "bug_report.json"):
    """Xử lý toàn bộ ảnh lỗi trong folder, xuất report JSON."""
    folder = Path(folder_path)
    image_extensions = {".png", ".jpg", ".jpeg", ".webp"}
    results = []

    screenshots = [f for f in folder.iterdir() if f.suffix.lower() in image_extensions]
    print(f"Tìm thấy {len(screenshots)} ảnh cần phân tích...")

    for idx, img_path in enumerate(screenshots, 1):
        print(f"[{idx}/{len(screenshots)}] Đang phân tích: {img_path.name}")
        try:
            analysis = analyze_with_ollama(str(img_path))
            if isinstance(analysis, str):
                analysis = json.loads(analysis)

            results.append({
                "file": img_path.name,
                "analyzed_at": datetime.now().isoformat(),
                **analysis
            })
        except Exception as e:
            results.append({
                "file": img_path.name,
                "error": f"Phân tích thất bại: {e}"
            })

    # Xuất report
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    print(f"\nHoàn thành! Report đã lưu tại: {output_file}")

    # Thống kê nhanh
    error_types = {}
    for r in results:
        et = r.get("error_type", "unknown")
        error_types[et] = error_types.get(et, 0) + 1

    print("\nThống kê loại lỗi:")
    for et, count in sorted(error_types.items(), key=lambda x: -x[1]):
        print(f"  {et}: {count} lỗi")

# Chạy
process_error_screenshots_folder("./screenshots", "./bug_reports/sprint_42.json")

Mẹo thực tế từ kinh nghiệm làm việc

Sau khoảng 2 tháng chạy cái này thực tế — xử lý vài trăm screenshot từ tester — mình gom được mấy thứ mà đọc docs không ai nói thẳng:

Prompt bằng tiếng Anh cho kết quả tốt hơn — Các Vision LLM được train chủ yếu trên data tiếng Anh. Dùng tiếng Anh trong prompt cho output chính xác hơn khoảng 20–30%, dù ảnh chứa UI tiếng Việt.
Luôn yêu cầu JSON output — Đừng để model trả về text tự do rồi parse thủ công. Với Ollama, dùng "format": "json". Với Moondream, embed cấu trúc JSON trong prompt.
Resize ảnh trước khi gửi — Ảnh 4K retina display không giúp model hiểu tốt hơn nhưng tăng thời gian xử lý gấp 3–4 lần. Resize về 1280px width là đủ — thử với 50 ảnh, thời gian giảm từ ~8 phút xuống còn ~2 phút.
Cache model trong memory — Nếu xử lý nhiều ảnh, load model 1 lần rồi dùng lại. Mỗi lần reload Moondream tốn thêm 4–5 giây khởi động, cộng dồn rất đáng kể.

# Resize ảnh trước khi phân tích
from PIL import Image

def preprocess_screenshot(image_path: str, max_width: int = 1280) -> Image.Image:
    img = Image.open(image_path).convert("RGB")
    if img.width > max_width:
        ratio = max_width / img.width
        new_size = (max_width, int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
    return img

Tích hợp vào workflow thực tế

Câu hỏi hay nhất từ teammates sau khi demo: “Cắm cái này vào đâu được?” Mình đã thử mấy chỗ và chạy ổn:

Jira/Linear webhook: Khi tester upload ảnh vào ticket, tự động gọi phân tích và điền thêm field “Error Type”, “Error Code”
Slack bot: Tester gửi ảnh vào channel #bugs, bot tự reply với phân tích tóm tắt
CI/CD pipeline: Sau mỗi E2E test run, nếu có screenshot failure, tự động phân tích và attach vào test report
Playwright/Cypress: Hook vào onTestFailed, gọi Vision LLM để tự mô tả lỗi trong báo cáo

Cái mình thích nhất ở cách tiếp cận này không phải độ chính xác của model — mà là nó đổi luôn format giao tiếp trong team. Thay vì “bị lỗi rồi anh ơi”, bạn nhận được: “HTTP 403 Forbidden tại endpoint /api/orders, xảy ra ở màn hình checkout khi click nút Submit”. Riêng cái đó tiết kiệm ít nhất 30 phút mỗi sprint, chỉ tính khoản hỏi thêm thông tin.

Bước tiếp theo nếu muốn đi xa hơn: kết hợp Vision LLM với RAG để tự động suggest fix dựa trên lỗi tương tự đã xử lý trước đó — nhưng đó là câu chuyện cho bài khác.