It started with a simple automation script: read a CSV exported from a monitoring system, filter servers with CPU > 80%, send alerts. Around 50 lines initially, using the built-in csv module, and it worked. Then requirements kept growing — process Excel files from the finance team, merge data from multiple sources, export monthly reports. By the time the script had ballooned to 500 lines and I was still using plain csv, I realized I’d been going down the wrong path.
It took nearly three days to refactor. After that, I stopped picking libraries out of habit. This article cuts straight to the practical question — not “what is pandas” but when to use what and how to use it correctly.
Comparing the Three Main Approaches
Python has three main libraries for this problem. Each fits a different situation:
Approach 1: Built-in csv Module (stdlib)
import csv
with open('servers.csv', newline='', encoding='utf-8') as f:
reader = csv.DictReader(f)
for row in reader:
if float(row['cpu_usage']) > 80:
print(row['hostname'])
No installation needed, runs in any environment. But each row is a plain Python dict — if you want to calculate the average CPU across 10,000 rows, you have to write the loop yourself.
Approach 2: pandas read_csv / read_excel
import pandas as pd
df = pd.read_csv('servers.csv')
high_cpu = df[df['cpu_usage'] > 80]
print(high_cpu[['hostname', 'cpu_usage']])
Same result in 3 lines instead of 6, and you can group, merge, and aggregate right after. This is the approach I use for most tabular data tasks.
Approach 3: Pure openpyxl / xlrd (Excel-only)
from openpyxl import load_workbook
wb = load_workbook('report.xlsx')
ws = wb.active
for row in ws.iter_rows(min_row=2, values_only=True):
hostname, cpu = row[0], row[1]
print(hostname, cpu)
When you need to manipulate Excel formatting — cell colors, merged cells, charts — openpyxl is mandatory. pandas doesn’t preserve styles when reading Excel.
Trade-off Analysis
| Criteria | csv module | pandas | openpyxl |
|---|---|---|---|
| Installation | None required | pip install pandas | pip install openpyxl |
| Large CSV (>1M rows) | Good (streaming) | Manageable (use chunks) | Not applicable |
| Computation & aggregation | Must write manually | Very powerful | Weak |
| Preserve Excel formatting | No | No | Yes |
| Merging multiple files | Manual | pd.concat / merge | Manual |
| Memory footprint | Low | Higher | Medium |
The key thing to remember about pandas: it loads the entire file into RAM. A 100MB CSV file actually consumes around 300–500MB — 3 to 5 times its size on disk. Under 200MB? Not worth worrying about. Multi-gigabyte files require chunking, or falling back to the csv module to stream line by line.
Which Approach Should You Choose?
I’ve distilled the decision into four scenarios:
- Just need to read row by row, simple processing, potentially very large files → use the
csvmodule - Need filtering, computation, merging, report exports — files under 500MB → use pandas
- Need to read/write Excel while preserving styles, colors, and charts → use openpyxl
- Need both computation and polished Excel output → pandas handles the data, openpyxl handles the styling
Looking back at that 2,000-line script: rewriting it with pandas from scratch brought it down to around 400–500 lines. The moment your data has more than 3 columns and requires any kind of computation, starting with pandas upfront is far cheaper than refactoring later.
Implementation Guide with pandas
Installation
pip install pandas openpyxl xlrd
# openpyxl: required for .xlsx
# xlrd: required for .xls (legacy format)
Reading and Inspecting Data
import pandas as pd
# Read CSV
df = pd.read_csv('servers.csv', encoding='utf-8')
# Read Excel (specific sheet)
df_excel = pd.read_excel('report.xlsx', sheet_name='January')
# Quick inspection
print(df.shape) # (row count, column count)
print(df.dtypes) # data type of each column
print(df.head(5)) # first 5 rows
print(df.isnull().sum()) # count null values per column
Filtering and Processing Data
# Filter rows
high_cpu = df[df['cpu_usage'] > 80]
# Multiple conditions
alert_servers = df[(df['cpu_usage'] > 80) & (df['status'] == 'running')]
# Select specific columns
result = df[['hostname', 'cpu_usage', 'memory_usage']]
# Add a computed column
df['cpu_category'] = pd.cut(
df['cpu_usage'],
bins=[0, 50, 80, 100],
labels=['normal', 'warning', 'critical']
)
# Group by and aggregate
summary = df.groupby('datacenter').agg({
'cpu_usage': ['mean', 'max'],
'hostname': 'count'
}).round(2)
Data Cleaning — The Step That’s Often Skipped
# Drop rows with null values in critical columns
df = df.dropna(subset=['hostname', 'cpu_usage'])
# Replace nulls with a default value
df['memory_usage'] = df['memory_usage'].fillna(0)
# Normalize data types
df['cpu_usage'] = pd.to_numeric(df['cpu_usage'], errors='coerce')
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
# Remove duplicate rows
df = df.drop_duplicates(subset=['hostname'])
Merging Multiple Files
import glob
# Merge all CSVs in a directory
all_files = glob.glob('/data/logs/*.csv')
df_all = pd.concat([pd.read_csv(f) for f in all_files], ignore_index=True)
# Join two DataFrames on a key (like SQL JOIN)
df_merged = pd.merge(
df_servers,
df_owners,
on='hostname',
how='left' # left, right, inner, outer
)
Exporting to CSV and Excel
# Write CSV
df.to_csv('output.csv', index=False, encoding='utf-8-sig') # utf-8-sig for correct encoding in Excel
# Write Excel — multiple sheets
with pd.ExcelWriter('report.xlsx', engine='openpyxl') as writer:
df_alert.to_excel(writer, sheet_name='Critical', index=False)
summary.to_excel(writer, sheet_name='Summary', index=False)
df_all.to_excel(writer, sheet_name='All Servers', index=False)
Handling Large Files: Chunking
Multi-gigabyte log files can’t be loaded entirely into RAM. pandas supports reading in chunks:
chunk_size = 50000 # 50k rows at a time
results = []
for chunk in pd.read_csv('large_log.csv', chunksize=chunk_size):
# Process each chunk
filtered = chunk[chunk['level'] == 'ERROR']
results.append(filtered)
# Combine results
df_errors = pd.concat(results, ignore_index=True)
This keeps memory usage stable — instead of loading 2GB into RAM, you only need around 200MB at any given time.
Patterns I Use Frequently in Automation
After refactoring that 2,000-line script, I distilled two helpers I reuse the most:
import pandas as pd
from pathlib import Path
from datetime import datetime
def load_csv_safe(filepath: str, required_cols: list) -> pd.DataFrame:
"""Read a CSV and validate that required columns are present."""
df = pd.read_csv(filepath)
missing = set(required_cols) - set(df.columns)
if missing:
raise ValueError(f"Missing columns: {missing}")
return df
def export_timestamped(df: pd.DataFrame, prefix: str, output_dir: str = ".") -> str:
"""Export a CSV with a timestamp in the filename."""
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"{prefix}_{ts}.csv"
filepath = Path(output_dir) / filename
df.to_csv(filepath, index=False, encoding='utf-8-sig')
return str(filepath)
# Usage
df = load_csv_safe('servers.csv', ['hostname', 'cpu_usage'])
output = export_timestamped(df[df['cpu_usage'] > 80], 'high_cpu_alert')
print(f"Exported: {output}")
The script used to crash silently at float(row['cpu_usage']) because someone pasted the wrong column header into the CSV. With load_csv_safe(), that error is caught at the very first step, with a message that clearly identifies which column is missing. In a production environment, that kind of fail-fast behavior saves more debugging time than any other pattern.
