Processing CSV and Excel Files with Python pandas: Comparing Approaches and Practical Implementation Guide – ITFROMZERO

It started with a simple automation script: read a CSV exported from a monitoring system, filter servers with CPU > 80%, send alerts. Around 50 lines initially, using the built-in csv module, and it worked. Then requirements kept growing — process Excel files from the finance team, merge data from multiple sources, export monthly reports. By the time the script had ballooned to 500 lines and I was still using plain csv, I realized I’d been going down the wrong path.

It took nearly three days to refactor. After that, I stopped picking libraries out of habit. This article cuts straight to the practical question — not “what is pandas” but when to use what and how to use it correctly.

Table of Contents

Comparing the Three Main Approaches

Python has three main libraries for this problem. Each fits a different situation:

Approach 1: Built-in `csv` Module (stdlib)

import csv

with open('servers.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        if float(row['cpu_usage']) > 80:
            print(row['hostname'])

No installation needed, runs in any environment. But each row is a plain Python dict — if you want to calculate the average CPU across 10,000 rows, you have to write the loop yourself.

Approach 2: pandas `read_csv` / `read_excel`

import pandas as pd

df = pd.read_csv('servers.csv')
high_cpu = df[df['cpu_usage'] > 80]
print(high_cpu[['hostname', 'cpu_usage']])

Same result in 3 lines instead of 6, and you can group, merge, and aggregate right after. This is the approach I use for most tabular data tasks.

Approach 3: Pure openpyxl / xlrd (Excel-only)

from openpyxl import load_workbook

wb = load_workbook('report.xlsx')
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
    hostname, cpu = row[0], row[1]
    print(hostname, cpu)

When you need to manipulate Excel formatting — cell colors, merged cells, charts — openpyxl is mandatory. pandas doesn’t preserve styles when reading Excel.

Trade-off Analysis

Criteria	csv module	pandas	openpyxl
Installation	None required	pip install pandas	pip install openpyxl
Large CSV (>1M rows)	Good (streaming)	Manageable (use chunks)	Not applicable
Computation & aggregation	Must write manually	Very powerful	Weak
Preserve Excel formatting	No	No	Yes
Merging multiple files	Manual	pd.concat / merge	Manual
Memory footprint	Low	Higher	Medium

The key thing to remember about pandas: it loads the entire file into RAM. A 100MB CSV file actually consumes around 300–500MB — 3 to 5 times its size on disk. Under 200MB? Not worth worrying about. Multi-gigabyte files require chunking, or falling back to the csv module to stream line by line.

Which Approach Should You Choose?

I’ve distilled the decision into four scenarios:

Just need to read row by row, simple processing, potentially very large files → use the csv module
Need filtering, computation, merging, report exports — files under 500MB → use pandas
Need to read/write Excel while preserving styles, colors, and charts → use openpyxl
Need both computation and polished Excel output → pandas handles the data, openpyxl handles the styling

Looking back at that 2,000-line script: rewriting it with pandas from scratch brought it down to around 400–500 lines. The moment your data has more than 3 columns and requires any kind of computation, starting with pandas upfront is far cheaper than refactoring later.

Implementation Guide with pandas

Installation

pip install pandas openpyxl xlrd
# openpyxl: required for .xlsx
# xlrd: required for .xls (legacy format)

Reading and Inspecting Data

import pandas as pd

# Read CSV
df = pd.read_csv('servers.csv', encoding='utf-8')

# Read Excel (specific sheet)
df_excel = pd.read_excel('report.xlsx', sheet_name='January')

# Quick inspection
print(df.shape)        # (row count, column count)
print(df.dtypes)       # data type of each column
print(df.head(5))      # first 5 rows
print(df.isnull().sum()) # count null values per column

Filtering and Processing Data

# Filter rows
high_cpu = df[df['cpu_usage'] > 80]

# Multiple conditions
alert_servers = df[(df['cpu_usage'] > 80) & (df['status'] == 'running')]

# Select specific columns
result = df[['hostname', 'cpu_usage', 'memory_usage']]

# Add a computed column
df['cpu_category'] = pd.cut(
    df['cpu_usage'],
    bins=[0, 50, 80, 100],
    labels=['normal', 'warning', 'critical']
)

# Group by and aggregate
summary = df.groupby('datacenter').agg({
    'cpu_usage': ['mean', 'max'],
    'hostname': 'count'
}).round(2)

Data Cleaning — The Step That’s Often Skipped

# Drop rows with null values in critical columns
df = df.dropna(subset=['hostname', 'cpu_usage'])

# Replace nulls with a default value
df['memory_usage'] = df['memory_usage'].fillna(0)

# Normalize data types
df['cpu_usage'] = pd.to_numeric(df['cpu_usage'], errors='coerce')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Remove duplicate rows
df = df.drop_duplicates(subset=['hostname'])

Merging Multiple Files

import glob

# Merge all CSVs in a directory
all_files = glob.glob('/data/logs/*.csv')
df_all = pd.concat([pd.read_csv(f) for f in all_files], ignore_index=True)

# Join two DataFrames on a key (like SQL JOIN)
df_merged = pd.merge(
    df_servers,
    df_owners,
    on='hostname',
    how='left'  # left, right, inner, outer
)

Exporting to CSV and Excel

# Write CSV
df.to_csv('output.csv', index=False, encoding='utf-8-sig')  # utf-8-sig for correct encoding in Excel

# Write Excel — multiple sheets
with pd.ExcelWriter('report.xlsx', engine='openpyxl') as writer:
    df_alert.to_excel(writer, sheet_name='Critical', index=False)
    summary.to_excel(writer, sheet_name='Summary', index=False)
    df_all.to_excel(writer, sheet_name='All Servers', index=False)

Handling Large Files: Chunking

Multi-gigabyte log files can’t be loaded entirely into RAM. pandas supports reading in chunks:

chunk_size = 50000  # 50k rows at a time
results = []

for chunk in pd.read_csv('large_log.csv', chunksize=chunk_size):
    # Process each chunk
    filtered = chunk[chunk['level'] == 'ERROR']
    results.append(filtered)

# Combine results
df_errors = pd.concat(results, ignore_index=True)

This keeps memory usage stable — instead of loading 2GB into RAM, you only need around 200MB at any given time.

Patterns I Use Frequently in Automation

After refactoring that 2,000-line script, I distilled two helpers I reuse the most:

import pandas as pd
from pathlib import Path
from datetime import datetime

def load_csv_safe(filepath: str, required_cols: list) -> pd.DataFrame:
    """Read a CSV and validate that required columns are present."""
    df = pd.read_csv(filepath)
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    return df

def export_timestamped(df: pd.DataFrame, prefix: str, output_dir: str = ".") -> str:
    """Export a CSV with a timestamp in the filename."""
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{prefix}_{ts}.csv"
    filepath = Path(output_dir) / filename
    df.to_csv(filepath, index=False, encoding='utf-8-sig')
    return str(filepath)

# Usage
df = load_csv_safe('servers.csv', ['hostname', 'cpu_usage'])
output = export_timestamped(df[df['cpu_usage'] > 80], 'high_cpu_alert')
print(f"Exported: {output}")

The script used to crash silently at float(row['cpu_usage']) because someone pasted the wrong column header into the CSV. With load_csv_safe(), that error is caught at the very first step, with a message that clearly identifies which column is missing. In a production environment, that kind of fail-fast behavior saves more debugging time than any other pattern.