Processing Millions of Python Records: Don’t Let MemoryError Crash Your Server – ITFROMZERO

Table of Contents

The Nightmare Named MemoryError

Back when I first started writing Python scripts to parse logs, I usually had the habit of using a list to store everything for convenience. Everything went smoothly until one day, a system log file spiked to over 5GB. The result: the script ran for a few seconds and then dropped dead with a brief but haunting message: <a href="https://itfromzero.com/en/development-vi-en/python-vi-en-development-vi-en/python-profiling-pro-tips-for-troubleshooting-slow-code-with-cprofile-and-py-spy.html">MemoryError</a>.

The problem was that I tried to load the entire file content into RAM to process it all at once. For small datasets, this is extremely fast. But when facing server logs with tens of millions of lines, no amount of RAM is enough. That’s when I realized the value of processing data in a “rolling” fashion through Generators and Iterators.

Why Does Your RAM “Evaporate” So Quickly?

In reality, a Python list stores every one of its elements in memory simultaneously. Let’s try creating a list of 10 million integers:

# RAM-heavy approach
my_list = [i for i in range(10000000)] # Occupies about 80MB-400MB of RAM depending on the system

Python requires the operating system to allocate a large enough contiguous space to hold all 10 million numbers. If the server is carrying many other services, this sudden memory grab can easily cause a system freeze. This is the “Eager Evaluation” mechanism—it wants everything ready immediately, even before it’s needed.

Iterator – The “Answer on Demand” Mechanism

Instead of bringing a whole basket of oranges home before eating, an Iterator is like going to the garden and picking exactly one fruit every time you want to eat. You don’t need a huge warehouse, and you don’t worry about oranges rotting from sitting too long.

An object is called an Iterator if it correctly implements the protocol consisting of two methods: __iter__() and __next__(). When you call next(), it calculates and returns the next value. Its state is preserved to know where to start next time.

class MyCounter:
    def __init__(self, low, high):
        self.current = low
        self.high = high

    def __iter__(self):
        return self

    def __next__ (self):
        if self.current > self.high:
            raise StopIteration
        self.current += 1
        return self.current - 1

# Usage
counter = MyCounter(1, 10000000)
for num in counter:
    # Process num here, RAM usage remains extremely low
    pass

Even if you count to 1 billion, the script still only consumes a tiny amount of RAM to store the self.current variable. Your server will breathe a sigh of relief.

Generator: The Secret Weapon with the yield Keyword

Writing an entire class just to make an Iterator is a bit cumbersome. Python provides a much more concise way: Generator. Instead of using return to exit a function, you just use yield.

When encountering yield, the function pauses and “freezes” its entire state right there. When called again, it “thaws” and runs from the line of code immediately following the yield. Extremely clever!

def my_generator(n):
    i = 0
    while i < n:
        yield i
        i += 1

# Only takes a few KB of RAM no matter how large n is
gen = my_generator(10000000)

Real-world Comparison of “Lightness”

Let’s use the sys module to check the actual object size. The result will surprise you:

import sys

n = 1000000
list_data = [i for i in range(n)]
gen_data = (i for i in range(n))

print(f"List consumes: {sys.getsizeof(list_data)} bytes (~8MB)")
print(f"Generator consumes: {sys.getsizeof(gen_data)} bytes (112 bytes)")

The list consumes over 70,000 times more memory than the Generator. If the data reaches billions of records, a Generator is the boundary between a smooth-running script and an OOM (Out Of Memory) server crash in the middle of the night.

Practical Application: Processing a 10GB Log File

In DevOps work, I frequently have to filter logs. If you use f.readlines(), you are making things hard for yourself if the file is several GBs. The right way is to leverage the fact that the file object in Python itself is an Iterator.

def filter_error_logs(file_path):
    with open(file_path, "r") as f:
        for line in f: # Read line by line, don't load the entire file into RAM
            if "ERROR" in line:
                yield line.strip()

# Process data pipeline
logs = filter_error_logs("huge_production.log")
for error in logs:
    # Send this error to Telegram or save to DB
    print(f"Error detected: {error}")

Data is only read and processed when actually needed. This is called Lazy Evaluation—a type of laziness that is extremely beneficial for performance.

Generator Expression: Writing Code that is Both Concise and Elegant

If you’re used to List Comprehensions like [x for x in data], you just need to replace the square brackets with parentheses (x for x in data) to get a Generator Expression immediately.

# Calculate the sum of squares of 10 million numbers without consuming RAM
total = sum(x*x for x in range(10000000))

The sum() function will take each number, square it, and then add it to the total immediately. No intermediate list is created, wasting resources.

Chaining Generators (Pipeline Pattern)

My favorite technique when handling ETL is chaining multiple generators. It’s like a production line, where each stage only processes one item at a time.

def get_lines(file_obj): yield from file_obj

def clean_lines(lines): 
    for line in lines: yield line.strip()

def find_critical(lines):
    for line in lines:
        if "CRITICAL" in line: yield line

# Connect the stages
with open("app.log") as f:
    critical_issues = find_critical(clean_lines(get_lines(f)))
    for issue in critical_issues:
        print(f"Critical: {issue}")

This way of writing is extremely transparent. You can easily maintain or add processing steps without worrying about impacting RAM.

Closing Thoughts from Real-world Experience

After waking up at 2 AM many times to restart servers, I’ve derived a rule: Whenever working with a data stream of unknown size, use a Generator.

Use List when: Data is small, you need random access (like getting the 10th element then going back to the 2nd), or you need to sort multiple times.
Use Generator/Iterator when: Processing large data, reading files, streaming data from a database, or you only need to iterate through the data once.

Memory optimization isn’t anything too mystical. Sometimes it just starts with changing a pair of [] to (). Hope these insights help you avoid MemoryError in your upcoming projects.