What is Cython? A Guide to Compiling Python to C for 10-150x Speed Gains – ITFROMZERO

Python is slow — this is a truth most developers know but tend to ignore, until they run into a genuinely heavy computation problem.

I once wrote a script to process 100K records and calculate similarity scores between text strings. It took 8 minutes to finish. The client needed results in 30 seconds. After porting the computation to Cython, the same script completed in 19 seconds. Same logic, same output — just compiled to C.

Comparing Popular Python Speed-Up Approaches

Python has several different paths to better performance. Knowing which one fits which type of problem will save you from wasting time optimizing in the wrong place.

1. Better Algorithms + Python Built-ins

This is the first step to take before considering anything else. Built-ins like map(), filter(), and list comprehensions are typically faster than regular for loops because they’re implemented in C under the hood. However, you’re still running the Python interpreter — the overhead is still there.

2. NumPy / Pandas Vectorization

Instead of looping over each element, you perform operations on entire arrays at once. NumPy operations run in C under the hood, making them much faster than pure Python. This approach works well when your problem can be expressed as matrix or array operations.

3. Cython — Compiling Python to C

Cython is a superset of Python: you write code that looks almost like regular Python, add some type annotations, and Cython compiles it into a C extension. The result is a module that runs at C speed but can still be imported from Python like any other module.

4. ctypes / Manual C Extensions

Use this when you already have a C library or want to write C yourself and call it from Python. This approach requires knowledge of C and manual memory management — a much higher barrier to entry than Cython.

Pros and Cons of Each Approach

Optimized Pure Python

Pros: No extra tools needed, easy to debug, simple code
Cons: Still bottlenecked by the Python interpreter and the GIL
Real-world speedup: 2–5x over naive implementations

NumPy Vectorization

Pros: Easy to use when the problem fits, rich ecosystem
Cons: Complex logic with many branches is very hard to express as array operations
Real-world speedup: 10–100x for numeric operations

Cython

Pros: Keeps familiar Python syntax, impressive speedups for any kind of logic, can call C libraries directly
Cons: Requires an extra compile step, build environment setup, slightly harder to debug than pure Python
Real-world speedup: 10–150x depending on the level of type annotation

ctypes / Manual C Extensions

Pros: Absolute control, maximum performance
Cons: Requires C knowledge, prone to memory leaks, time-consuming to write
Real-world speedup: Comparable to Cython in most practical cases

When Should You Choose Cython?

Choose Cython when all of the following conditions apply:

You’ve profiled your code and identified the bottleneck as a specific CPU-bound function
That function has complex logic — deeply nested loops, many conditional branches — that’s hard to vectorize with NumPy
You want the speedup without learning C from scratch
You need to integrate with an existing C/C++ library

For purely numeric tasks like matrix multiplication or convolution, NumPy or PyTorch/JAX are better choices — these libraries are deeply optimized at the hardware level, and Cython can’t compete there. Cython shines most when the processing logic is complex, with many conditional branches that NumPy can’t express cleanly.

Step-by-Step Guide to Using Cython

Step 1: Installation

pip install cython numpy

# A C compiler is required:
# Ubuntu/Debian:
sudo apt install gcc python3-dev

# macOS:
xcode-select --install

# Windows: install Visual Studio Build Tools (select "Desktop development with C++")

Step 2: Write a Cython Module (.pyx)

A concrete example: a function that calculates the sum of squares of a number sequence — the kind of computation where pure Python is very slow on large data. Create a file called fast_math.pyx:

# fast_math.pyx

def sum_squares_python(numbers):
    """Pure Python version — for baseline comparison"""
    total = 0
    for x in numbers:
        total += x * x
    return total


def sum_squares_cython(list numbers):
    """Cython with type annotations — eliminates dynamic typing overhead"""
    cdef double total = 0.0
    cdef double x
    cdef int i
    cdef int n = len(numbers)

    for i in range(n):
        x = numbers[i]
        total += x * x

    return total


def sum_squares_array(double[:] arr):
    """Typed memoryview — direct access to numpy array memory"""
    cdef double total = 0.0
    cdef int i
    cdef int n = arr.shape[0]

    for i in range(n):
        total += arr[i] * arr[i]

    return total

Two key concepts to understand:

cdef double x — declares a C-typed variable, completely eliminating Python’s dynamic typing overhead
double[:] arr — a typed memoryview, allowing direct access to numpy array memory without going through the Python object layer

Step 3: Create setup.py to Compile

# setup.py
from setuptools import setup
from Cython.Build import cythonize
import numpy as np

setup(
    name="fast_math",
    ext_modules=cythonize(
        "fast_math.pyx",
        compiler_directives={
            "language_level": "3",
            "boundscheck": False,   # Disable bounds checking — faster but requires valid indices
            "wraparound": False,    # Disable negative indexing (-1, -2...)
        }
    ),
    include_dirs=[np.get_include()],
)

Step 4: Compile

python setup.py build_ext --inplace

Once compiled, your directory will contain a new file: fast_math.cpython-3XX-linux-gnu.so (Linux) or .pyd (Windows). The C extension is ready — import it like any other Python module.

Step 5: Benchmark to See the Difference

# benchmark.py
import time
import numpy as np
import fast_math

data = list(range(1_000_000))
arr = np.array(data, dtype=np.float64)

def measure(label, fn, *args):
    start = time.perf_counter()
    result = fn(*args)
    elapsed = time.perf_counter() - start
    return elapsed, result

t_py, r1 = measure("Python", fast_math.sum_squares_python, data)
t_cy, r2 = measure("Cython list", fast_math.sum_squares_cython, data)
t_arr, r3 = measure("Cython array", fast_math.sum_squares_array, arr)

print(f"Pure Python:      {t_py:.4f}s")
print(f"Cython (list):    {t_cy:.4f}s  →  {t_py/t_cy:.1f}x faster")
print(f"Cython (ndarray): {t_arr:.4f}s  →  {t_py/t_arr:.1f}x faster")

Actual results (Python 3.11, Intel i7):

Pure Python:      0.2847s
Cython (list):    0.0312s  →  9.1x faster
Cython (ndarray): 0.0018s  →  158.2x faster

Same logic, same output — but 158x faster using a typed memoryview. This is exactly the improvement I achieved with the 100K record processing script: from 8 minutes down to under 20 seconds.

Bonus: Use Annotations to Find Optimization Opportunities

Cython includes a very useful tool: it can generate an HTML report showing where your code still has Python overhead (darker yellow = slower = needs more type annotations):

cython -a fast_math.pyx
# Open fast_math.html in a browser to view the report

Things to Keep in Mind Before Using Cython

Profile first, optimize second: Use cProfile or line_profiler to pinpoint the actual bottleneck. Optimizing the wrong place wastes time and delivers nothing.
Only effective for CPU-bound code: I/O-bound operations like file reads, API calls, and database queries will not get faster with Cython.
Build environment at deploy time: The server needs a C compiler, or you must pre-build wheel files. The cibuildwheel tool automates building for multiple platforms.
Keep the pure Python version: Always maintain the pure Python code for debugging and logic testing — the Cython version is just a compiled copy, not the only source of truth.

Cython is not a magic fix for every performance problem. But for compute-heavy tasks with complex logic, it’s the most practical tool for speeding things up — no need to rewrite your codebase, no need to learn a new language.