Processing CSV and Excel Files with Python pandas: Comparing Approaches and Practical Implementation Guide

Python tutorial - IT technology blog
Python tutorial - IT technology blog

It started with a simple automation script: read a CSV exported from a monitoring system, filter servers with CPU > 80%, send alerts. Around 50 lines initially, using the built-in csv module, and it worked. Then requirements kept growing — process Excel files from the finance team, merge data from multiple sources, export monthly reports. By the time the script had ballooned to 500 lines and I was still using plain csv, I realized I’d been going down the wrong path.

It took nearly three days to refactor. After that, I stopped picking libraries out of habit. This article cuts straight to the practical question — not “what is pandas” but when to use what and how to use it correctly.

Comparing the Three Main Approaches

Python has three main libraries for this problem. Each fits a different situation:

Approach 1: Built-in csv Module (stdlib)

import csv

with open('servers.csv', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        if float(row['cpu_usage']) > 80:
            print(row['hostname'])

No installation needed, runs in any environment. But each row is a plain Python dict — if you want to calculate the average CPU across 10,000 rows, you have to write the loop yourself.

Approach 2: pandas read_csv / read_excel

import pandas as pd

df = pd.read_csv('servers.csv')
high_cpu = df[df['cpu_usage'] > 80]
print(high_cpu[['hostname', 'cpu_usage']])

Same result in 3 lines instead of 6, and you can group, merge, and aggregate right after. This is the approach I use for most tabular data tasks.

Approach 3: Pure openpyxl / xlrd (Excel-only)

from openpyxl import load_workbook

wb = load_workbook('report.xlsx')
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
    hostname, cpu = row[0], row[1]
    print(hostname, cpu)

When you need to manipulate Excel formatting — cell colors, merged cells, charts — openpyxl is mandatory. pandas doesn’t preserve styles when reading Excel.

Trade-off Analysis

Criteria csv module pandas openpyxl
Installation None required pip install pandas pip install openpyxl
Large CSV (>1M rows) Good (streaming) Manageable (use chunks) Not applicable
Computation & aggregation Must write manually Very powerful Weak
Preserve Excel formatting No No Yes
Merging multiple files Manual pd.concat / merge Manual
Memory footprint Low Higher Medium

The key thing to remember about pandas: it loads the entire file into RAM. A 100MB CSV file actually consumes around 300–500MB — 3 to 5 times its size on disk. Under 200MB? Not worth worrying about. Multi-gigabyte files require chunking, or falling back to the csv module to stream line by line.

Which Approach Should You Choose?

I’ve distilled the decision into four scenarios:

  • Just need to read row by row, simple processing, potentially very large files → use the csv module
  • Need filtering, computation, merging, report exports — files under 500MB → use pandas
  • Need to read/write Excel while preserving styles, colors, and charts → use openpyxl
  • Need both computation and polished Excel output → pandas handles the data, openpyxl handles the styling

Looking back at that 2,000-line script: rewriting it with pandas from scratch brought it down to around 400–500 lines. The moment your data has more than 3 columns and requires any kind of computation, starting with pandas upfront is far cheaper than refactoring later.

Implementation Guide with pandas

Installation

pip install pandas openpyxl xlrd
# openpyxl: required for .xlsx
# xlrd: required for .xls (legacy format)

Reading and Inspecting Data

import pandas as pd

# Read CSV
df = pd.read_csv('servers.csv', encoding='utf-8')

# Read Excel (specific sheet)
df_excel = pd.read_excel('report.xlsx', sheet_name='January')

# Quick inspection
print(df.shape)        # (row count, column count)
print(df.dtypes)       # data type of each column
print(df.head(5))      # first 5 rows
print(df.isnull().sum()) # count null values per column

Filtering and Processing Data

# Filter rows
high_cpu = df[df['cpu_usage'] > 80]

# Multiple conditions
alert_servers = df[(df['cpu_usage'] > 80) & (df['status'] == 'running')]

# Select specific columns
result = df[['hostname', 'cpu_usage', 'memory_usage']]

# Add a computed column
df['cpu_category'] = pd.cut(
    df['cpu_usage'],
    bins=[0, 50, 80, 100],
    labels=['normal', 'warning', 'critical']
)

# Group by and aggregate
summary = df.groupby('datacenter').agg({
    'cpu_usage': ['mean', 'max'],
    'hostname': 'count'
}).round(2)

Data Cleaning — The Step That’s Often Skipped

# Drop rows with null values in critical columns
df = df.dropna(subset=['hostname', 'cpu_usage'])

# Replace nulls with a default value
df['memory_usage'] = df['memory_usage'].fillna(0)

# Normalize data types
df['cpu_usage'] = pd.to_numeric(df['cpu_usage'], errors='coerce')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Remove duplicate rows
df = df.drop_duplicates(subset=['hostname'])

Merging Multiple Files

import glob

# Merge all CSVs in a directory
all_files = glob.glob('/data/logs/*.csv')
df_all = pd.concat([pd.read_csv(f) for f in all_files], ignore_index=True)

# Join two DataFrames on a key (like SQL JOIN)
df_merged = pd.merge(
    df_servers,
    df_owners,
    on='hostname',
    how='left'  # left, right, inner, outer
)

Exporting to CSV and Excel

# Write CSV
df.to_csv('output.csv', index=False, encoding='utf-8-sig')  # utf-8-sig for correct encoding in Excel

# Write Excel — multiple sheets
with pd.ExcelWriter('report.xlsx', engine='openpyxl') as writer:
    df_alert.to_excel(writer, sheet_name='Critical', index=False)
    summary.to_excel(writer, sheet_name='Summary', index=False)
    df_all.to_excel(writer, sheet_name='All Servers', index=False)

Handling Large Files: Chunking

Multi-gigabyte log files can’t be loaded entirely into RAM. pandas supports reading in chunks:

chunk_size = 50000  # 50k rows at a time
results = []

for chunk in pd.read_csv('large_log.csv', chunksize=chunk_size):
    # Process each chunk
    filtered = chunk[chunk['level'] == 'ERROR']
    results.append(filtered)

# Combine results
df_errors = pd.concat(results, ignore_index=True)

This keeps memory usage stable — instead of loading 2GB into RAM, you only need around 200MB at any given time.

Patterns I Use Frequently in Automation

After refactoring that 2,000-line script, I distilled two helpers I reuse the most:

import pandas as pd
from pathlib import Path
from datetime import datetime

def load_csv_safe(filepath: str, required_cols: list) -> pd.DataFrame:
    """Read a CSV and validate that required columns are present."""
    df = pd.read_csv(filepath)
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    return df

def export_timestamped(df: pd.DataFrame, prefix: str, output_dir: str = ".") -> str:
    """Export a CSV with a timestamp in the filename."""
    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"{prefix}_{ts}.csv"
    filepath = Path(output_dir) / filename
    df.to_csv(filepath, index=False, encoding='utf-8-sig')
    return str(filepath)

# Usage
df = load_csv_safe('servers.csv', ['hostname', 'cpu_usage'])
output = export_timestamped(df[df['cpu_usage'] > 80], 'high_cpu_alert')
print(f"Exported: {output}")

The script used to crash silently at float(row['cpu_usage']) because someone pasted the wrong column header into the CSV. With load_csv_safe(), that error is caught at the very first step, with a message that clearly identifies which column is missing. In a production environment, that kind of fail-fast behavior saves more debugging time than any other pattern.

Share: