Using Vision LLMs (Moondream, LLaVA) for Automated UI Error Screenshot Analysis — A Practical Guide – ITFROMZERO

I once spent nearly 2 hours debugging a UI error because a tester sent a screenshot with the description “it’s broken” — no stack trace, no logs, just an image. That’s when I started looking into automating UI error analysis with Vision LLMs.

Vision LLMs are no longer just for identifying cats and dogs. You can describe your requirements in natural language — “list all UI errors in this image, return JSON” — and the model returns exactly what you need. Moondream 2, for example, is only ~2GB but can read error messages, HTTP status codes, and even the disabled/enabled state of individual buttons in a screenshot.

Table of Contents

Comparing Approaches for UI Error Screenshot Analysis

There are 3 common approaches — I’ll lay out the pros and cons of each so you can decide for yourself:

Approach 1: Traditional OCR (Tesseract, EasyOCR)

Use OCR to extract text from images, then parse that text. Fast, fully offline, no API costs.

Pros: Lightweight, free, no powerful GPU required
Cons: Only reads plain text, no context understanding. Can’t distinguish between “this button is disabled”, “this is an error message”, or “this is a regular label”. Low-quality images or unusual fonts will cause failures.

Approach 2: Cloud Vision APIs (Google Vision, AWS Rekognition, Azure Computer Vision)

Call cloud provider APIs to analyze images. Better results than plain OCR — can detect text, objects, and labels.

Pros: High accuracy, no model setup required
Cons: Costs money per call (Google Vision ~$1.50/1000 images), images must be sent to the cloud — a concern with sensitive data, still poor at understanding UI context

Approach 3: Vision LLMs (Moondream, LLaVA, GPT-4 Vision)

Use large language models with image understanding capability. Describe your requirements in English, and the model returns structured analysis.

Pros: Excellent UI context understanding — can distinguish error messages, button states, form validation, modal dialogs. Can return structured JSON. Can run locally (Moondream, LLaVA).
Cons: Larger models require more RAM/VRAM. Slower than plain OCR when running locally on CPU.

When Should You Choose a Vision LLM?

If you just need to extract plain text from good-quality images, OCR is sufficient. But for UI error screenshots, you need more:

Distinguish error types: validation error, network error, permission error, crash
Identify the failing component: which form field, which button, which API endpoint
Extract error codes, HTTP statuses, stack traces if visible in the image
Describe context: which screen the user was on, what they were doing before the error occurred

This is a context-understanding problem, not just text reading — and Vision LLMs are significantly better at it.

Which Model to Choose: Moondream vs LLaVA vs GPT-4 Vision?

Model	Runs Locally	RAM Required	Accuracy	Speed (CPU)
Moondream 2	✅	~2GB	Pretty good	Fastest
LLaVA 7B	✅	~8GB	Good	Moderate
LLaVA 13B	✅	~16GB	Very good	Slow
GPT-4 Vision	❌ (API only)	N/A	Best	Fast (API)

For production environments with sensitive data: Moondream 2 is a great starting point — compact, runs on a standard CPU, and good enough for UI screenshots. Go with LLaVA 7B if you have a GPU or need higher accuracy.

Implementation Guide: Analyzing UI Error Screenshots with Moondream

Step 1: Set Up the Environment

# Create a virtualenv
python -m venv venv
source venv/bin/activate  # Linux/Mac
# venv\Scripts\activate   # Windows

# Install dependencies
pip install transformers torch pillow einops

If you don’t have a GPU, install the CPU-only version of torch to keep things lighter:

pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Step 2: Basic UI Error Screenshot Analysis Script

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import json

# Load model (first run will download ~2GB)
model_id = "vikhyatk/moondream2"
revision = "2025-01-09"  # Pin to a specific revision to avoid breaking changes

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    revision=revision
)

def analyze_error_screenshot(image_path: str) -> dict:
    """Analyze a UI error screenshot and return a structured dict."""
    image = Image.open(image_path).convert("RGB")
    enc_image = model.encode_image(image)

    # English prompts tend to produce better results with Vision LLMs
    prompt = """Analyze this UI error screenshot. Extract:
1. Error type (validation/network/permission/crash/other)
2. Error message text (exact if visible)
3. HTTP status code or error code if visible
4. Which UI component has the error (button/form field/page/modal)
5. Any stack trace or technical details visible

Respond in JSON format:
{"error_type": "", "error_message": "", "error_code": "", "component": "", "details": ""}"""

    result = model.answer_question(enc_image, prompt, tokenizer)

    # Parse JSON from the response
    try:
        # Find JSON in the response (model sometimes adds extra text)
        start = result.find("{")
        end = result.rfind("}") + 1
        if start != -1 and end > start:
            return json.loads(result[start:end])
    except json.JSONDecodeError:
        pass

    # Fallback: return raw text if JSON parsing fails
    return {"raw_analysis": result}


# Test
if __name__ == "__main__":
    result = analyze_error_screenshot("error_screenshot.png")
    print(json.dumps(result, indent=2, ensure_ascii=False))

Step 3: Integrating with Ollama to Use LLaVA (Optional)

If you already have Ollama running, using LLaVA through its local API is even simpler than loading the model manually:

# Pull the LLaVA model
ollama pull llava:7b

import requests
import base64
import json
from pathlib import Path

def analyze_with_ollama(image_path: str, model: str = "llava:7b") -> dict:
    """Analyze an error screenshot via the Ollama API."""
    # Encode the image as base64
    image_data = base64.b64encode(Path(image_path).read_bytes()).decode("utf-8")

    prompt = """Look at this UI error screenshot carefully.
Extract the following information and return ONLY valid JSON:
{
  "error_type": "validation|network|permission|crash|unknown",
  "error_message": "exact error text visible in the image",
  "error_code": "HTTP status or error code if visible, else null",
  "affected_component": "which UI element has the error",
  "suggested_cause": "brief technical cause based on what you see"
}"""

    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "images": [image_data],
            "stream": False,
            "format": "json"  # Force Ollama to output JSON
        },
        timeout=60
    )
    response.raise_for_status()
    return response.json()["response"]

# Usage
result = analyze_with_ollama("ui_error.png")
print(result)

Step 4: Building a Batch Processing Pipeline for Error Screenshots

Testers sending an entire folder of 20–30 images at once is completely normal. The script below processes all of them in one pass and exports a JSON report:

import os
import json
from pathlib import Path
from datetime import datetime

def process_error_screenshots_folder(folder_path: str, output_file: str = "bug_report.json"):
    """Process all error screenshots in a folder and export a JSON report."""
    folder = Path(folder_path)
    image_extensions = {".png", ".jpg", ".jpeg", ".webp"}
    results = []

    screenshots = [f for f in folder.iterdir() if f.suffix.lower() in image_extensions]
    print(f"Found {len(screenshots)} images to analyze...")

    for idx, img_path in enumerate(screenshots, 1):
        print(f"[{idx}/{len(screenshots)}] Analyzing: {img_path.name}")
        try:
            analysis = analyze_with_ollama(str(img_path))
            if isinstance(analysis, str):
                analysis = json.loads(analysis)

            results.append({
                "file": img_path.name,
                "analyzed_at": datetime.now().isoformat(),
                **analysis
            })
        except Exception as e:
            results.append({
                "file": img_path.name,
                "error": f"Analysis failed: {e}"
            })

    # Export report
    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=2, ensure_ascii=False)

    print(f"\nDone! Report saved to: {output_file}")

    # Quick summary
    error_types = {}
    for r in results:
        et = r.get("error_type", "unknown")
        error_types[et] = error_types.get(et, 0) + 1

    print("\nError type breakdown:")
    for et, count in sorted(error_types.items(), key=lambda x: -x[1]):
        print(f"  {et}: {count} errors")

# Run
process_error_screenshots_folder("./screenshots", "./bug_reports/sprint_42.json")

Practical Tips from Real-World Experience

After about 2 months running this in production — processing a few hundred screenshots from testers — I’ve accumulated a few things that nobody in the docs tells you straight:

English prompts yield better results — Vision LLMs are primarily trained on English data. Using English in your prompts produces ~20–30% more accurate output, even when the UI in the screenshot is in another language.
Always request JSON output — Don’t let the model return free-form text that you then have to parse manually. With Ollama, use "format": "json". With Moondream, embed the JSON structure directly in the prompt.
Resize images before sending — A 4K retina display screenshot doesn’t help the model understand the content any better, but it triples or quadruples processing time. Resizing to 1280px width is sufficient — tested across 50 images, processing time dropped from ~8 minutes to ~2 minutes.
Cache the model in memory — If you’re processing many images, load the model once and reuse it. Each Moondream reload adds 4–5 seconds of startup overhead, which adds up fast at scale.

# Resize image before analysis
from PIL import Image

def preprocess_screenshot(image_path: str, max_width: int = 1280) -> Image.Image:
    img = Image.open(image_path).convert("RGB")
    if img.width > max_width:
        ratio = max_width / img.width
        new_size = (max_width, int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)
    return img

Integrating Into a Real Workflow

The best question from teammates after the demo was: “Where can we plug this in?” I’ve tried a few places and they all work well:

Jira/Linear webhooks: When a tester uploads an image to a ticket, automatically trigger analysis and populate the “Error Type” and “Error Code” fields
Slack bot: Testers send a screenshot to the #bugs channel, the bot automatically replies with a summarized analysis
CI/CD pipeline: After each E2E test run, if there are screenshot failures, automatically analyze them and attach the results to the test report
Playwright/Cypress: Hook into onTestFailed, call the Vision LLM to auto-describe the failure in the report

What I appreciate most about this approach isn’t the model’s accuracy — it’s that it changes the communication format across the team entirely. Instead of “it’s broken”, you receive: “HTTP 403 Forbidden at endpoint /api/orders, occurring on the checkout screen when clicking the Submit button”. That alone saves at least 30 minutes per sprint, just counting the back-and-forth asking for more information.

The next step if you want to go further: combine Vision LLMs with RAG to automatically suggest fixes based on similar errors that have been resolved before — but that’s a story for another post.