Automatically Generate Python Unit Tests with AI: A Practical Guide to Pynguin and CodiumAI for Higher Coverage

Artificial Intelligence tutorial - IT technology blog
Artificial Intelligence tutorial - IT technology blog

Do It in 5 Minutes: Install and Run Pynguin

Say you have a Python file with a few calculation functions but no tests yet. Instead of sitting there wondering “what test cases do I need?”, let AI do that work first.

Install Pynguin:

pip install pynguin

Create a calculator.py file with some sample functions:

def add(a: int, b: int) -> int:
    return a + b

def divide(a: float, b: float) -> float:
    if b == 0:
        raise ValueError("Cannot divide by zero")
    return a / b

def is_palindrome(s: str) -> bool:
    s = s.lower().strip()
    return s == s[::-1]

Run Pynguin to automatically generate tests:

pynguin \
  --project-path . \
  --module-name calculator \
  --output-path tests/

After a few dozen seconds, Pynguin generates tests/test_calculator.py. Here’s what the sample output looks like:

import pytest
from calculator import add, divide, is_palindrome

def test_add_0():
    assert add(0, 0) == 0

def test_add_1():
    assert add(1, 2) == 3

def test_divide_0():
    assert divide(4.0, 2.0) == pytest.approx(2.0)

def test_divide_raises():
    with pytest.raises(ValueError):
        divide(1.0, 0.0)

def test_is_palindrome_0():
    assert is_palindrome("racecar") is True

def test_is_palindrome_1():
    assert is_palindrome("hello") is False

Run the tests and check coverage:

pytest tests/ -v --cov=calculator --cov-report=term-missing

Done. From zero tests to 80–90% coverage in just a few minutes.

How Does Pynguin Work?

Pynguin uses a search-based software testing (SBST) algorithm — specifically a genetic algorithm that “evolves” test cases across multiple generations to maximize branch coverage. It doesn’t guess randomly; it optimizes toward a clear goal: covering as many code branches as possible.

What I find impressive in practice is that Pynguin is quite good at discovering edge cases — especially exception branches. It automatically tries passing None, negative numbers, empty strings, extremely large floats… to see how the function handles them. Several hidden type-checking bugs I’d encountered in legacy code were caught by Pynguin on the very first run.

Important requirement: Code must have type hints. Pynguin relies on type annotations to know what kind of input values to generate. Functions missing a: int or -> float annotations will produce very limited results — or nothing at all.

CodiumAI: When AI Understands Logic Instead of Just Measuring Coverage

Pynguin excels at structural coverage, but its tests are fairly “mechanical” — names like test_add_0, values that look meaningless like add(1234567, -9876543). This is where CodiumAI (now rebranded as Qodo) has the edge.

CodiumAI uses an LLM to understand a function’s business logic, generating semantically meaningful test cases — test names that describe behavior, scenarios that cover more realistic situations.

Install CodiumAI via VS Code

Find the “Qodo Gen” extension (formerly CodiumAI) in the VS Code Marketplace, install it, and sign up for a free account. Once installed, open a Python file and click on any function name — a “Generate Tests” button will appear just above it. Click it, and the AI analyzes the function and displays suggested test cases in a panel on the right.

For the same divide() function above, CodiumAI generates:

class TestDivide:
    def test_divide_positive_numbers(self):
        """Divide two positive numbers — result is accurate"""
        assert divide(10.0, 2.0) == 5.0

    def test_divide_negative_dividend(self):
        """Negative dividend still computes correctly"""
        assert divide(-10.0, 2.0) == -5.0

    def test_divide_by_zero_raises_value_error(self):
        """Dividing by zero must raise ValueError with the correct message"""
        with pytest.raises(ValueError, match="Cannot divide by zero"):
            divide(5.0, 0.0)

    def test_divide_result_precision(self):
        """Float result maintains correct precision"""
        result = divide(7.0, 2.0)
        assert result == pytest.approx(3.5)

Same function, same tests — but with names that clearly describe the scenario, docstrings, and even error message validation. Drop these into a code review and your teammates can read them instantly, no explanation needed.

Combining Both: A Practical Strategy

Through hands-on experience, I’ve come to see this as one of the most important skills to develop — not choosing between Pynguin or CodiumAI, but using both at different stages:

  • Pynguin — use it when you need rapid coverage for a legacy codebase with no tests. Run it once and get an instant 60–80% baseline.
  • CodiumAI — use it for new features, when you need meaningful, reviewable tests you’re proud to push to CI/CD.

Step-by-Step Practical Workflow

# Step 1: Pynguin scans the entire module
pynguin \
  --project-path . \
  --module-name myapp.utils \
  --output-path tests/auto/ \
  --maximum-search-time 120

# Step 2: Check current coverage
pytest tests/ --cov=myapp --cov-report=html

# Step 3: Open htmlcov/index.html and see which branches are still red
# Step 4: Use CodiumAI to write targeted tests for those specific branches

Customizing Pynguin for Complex Code

By default, Pynguin runs for 600 seconds per module. For modules with many branches, switch to the DYNAMOSA algorithm:

pynguin \
  --project-path . \
  --module-name myapp.services.payment \
  --output-path tests/auto/ \
  --maximum-search-time 300 \
  --algorithm DYNAMOSA \
  --seed 42

DYNAMOSA (Dynamic Many-Objective Sorting Algorithm) performs better when a module has many nested conditional branches.

Practical Tips You Can’t Afford to Skip

1. Add type hints before running Pynguin. This is a prerequisite. For legacy codebases without annotations, use mypy --install-types to auto-suggest them, or MonkeyType to automatically add type hints from runtime data.

2. Review carefully before committing AI-generated tests. Pynguin doesn’t understand business logic — it only records the current output of a function. If the function has a bug, the test will assert against that bug. I once had a case where all tests passed but the discount calculation logic was wrong; Pynguin still generated assert discount == 10 because that was the actual output at runtime.

3. Measure coverage before and after to see the impact clearly:

# Before
pytest --cov=myapp --cov-report=term | grep TOTAL
# TOTAL  342  289  15%

# After adding AI-generated tests
pytest --cov=myapp --cov-report=term | grep TOTAL
# TOTAL  342   62  82%

4. Mock external dependencies first. Both Pynguin and CodiumAI generate poor tests for functions that call databases or make HTTP requests. Separate pure logic from I/O, or use mocks:

from unittest.mock import patch, MagicMock

def test_fetch_user_with_mock():
    with patch("myapp.services.requests.get") as mock_get:
        mock_get.return_value = MagicMock(
            status_code=200,
            json=lambda: {"id": 1, "name": "Test User"}
        )
        result = fetch_user(1)
        assert result["name"] == "Test User"

5. CodiumAI generates better tests when functions have docstrings. When a function clearly documents its behavior and expected cases, the LLM understands the context and produces far more accurate, scenario-specific tests compared to undocumented functions.

The real strength of both tools isn’t to replace writing tests — it’s to help you get started quickly, cover the basics, and surface edge cases you hadn’t thought of. For critical business logic, you should still write and review tests by hand.

Share: