PostgreSQL Backup Done Right: Strategy, Automation, and Real-World Restore Testing – ITFROMZERO

Table of Contents

Backup “complete” — but restore fails: a story from experience

After several years working with PostgreSQL, I’ve noticed that most people set up their backup and then… forget about it entirely. Run pg_dump every night, logs show success, and that feels like enough. Until the day you actually need to restore — and everything falls apart.

I’ve worked with MySQL, PostgreSQL, and MongoDB. Each has its strengths. But PostgreSQL has the most robust backup ecosystem of the bunch — if you know how to use it properly. This post covers the workflow I actually use in production, including the restore testing part that most people skip.

Context: When to use which backup method

PostgreSQL has 3 main backup methods, and picking the wrong one can cost you hours when an incident hits:

pg_dump / pg_dumpall: Logical backup, exports to SQL or custom format. Best for migrating data, backing up individual databases, or selectively restoring just a few tables.
pg_basebackup: Full cluster backup at the filesystem level. Much faster, suited for large databases, and required when setting up streaming replication.
WAL Archiving + PITR: Point-in-time recovery — restore to any moment in the past. More complex, but this is the last line of defense for critical production systems.

Simple rule of thumb: databases under 10GB are fine with pg_dump. Above 10GB or if you need PITR, invest in pg_basebackup combined with WAL archiving.

Setup: Preparing directories and access permissions

Before writing any backup script, set up the environment properly — this is the step people rush through, then hit permission errors when cron actually runs:

# Create a clear backup directory structure
sudo mkdir -p /var/backups/postgresql/{daily,weekly,monthly}
sudo chown postgres:postgres /var/backups/postgresql -R
sudo chmod 750 /var/backups/postgresql -R

# Verify the postgres user can write to it
sudo -u postgres touch /var/backups/postgresql/daily/.test && echo "OK" || echo "Permission denied"

The daily/weekly/monthly structure makes rotation much easier later on. Dumping everything into a single directory means you won’t be able to find anything after a few weeks.

To let cron run without a password prompt, use a .pgpass file:

# ~/.pgpass — format: hostname:port:database:username:password
echo "localhost:5432:*:postgres:your_password" > /var/lib/postgresql/.pgpass
chown postgres:postgres /var/lib/postgresql/.pgpass
chmod 600 /var/lib/postgresql/.pgpass

Configuration: Automated backup scripts

Per-database backup with pg_dump

Custom format (-Fc) compresses roughly 60–70% better than plain SQL — a 10GB database typically produces a 3–4GB dump file. It also supports parallel restore with multiple workers, something plain SQL simply can’t do:

#!/bin/bash
# /usr/local/bin/pg_backup_daily.sh

BACKUP_DIR="/var/backups/postgresql/daily"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=7
LOG_FILE="/var/log/pg_backup.log"

echo "[$(date '+%Y-%m-%d %H:%M:%S')] === Daily backup started ===" >> "$LOG_FILE"

# Back up each database, skipping template databases
for DB in $(sudo -u postgres psql -Atc "SELECT datname FROM pg_database WHERE datistemplate = false;"); do
    BACKUP_FILE="$BACKUP_DIR/${DB}_${DATE}.dump"

    sudo -u postgres pg_dump \
        --format=custom \
        --compress=9 \
        --file="$BACKUP_FILE" \
        "$DB" 2>> "$LOG_FILE"

    if [ $? -eq 0 ]; then
        SIZE=$(du -sh "$BACKUP_FILE" | cut -f1)
        echo "[OK] $DB → $BACKUP_FILE ($SIZE)" >> "$LOG_FILE"
    else
        echo "[FAIL] Backup failed for database: $DB" >> "$LOG_FILE"
        # Add alerting here: curl Slack webhook, send email, etc.
    fi
done

# Remove backups older than RETENTION_DAYS days
find "$BACKUP_DIR" -name "*.dump" -mtime +$RETENTION_DAYS -delete
echo "[INFO] Cleaned up backups older than $RETENTION_DAYS days" >> "$LOG_FILE"

That find ... -delete at the end is just as important as the backup itself. Disk full → cron fails silently → no new backups → incident happens. I’ve seen this exact scenario play out.

Scheduling with cron

# Open crontab for the postgres user
sudo -u postgres crontab -e

# Daily backup at 2:00 AM
0 2 * * * /usr/local/bin/pg_backup_daily.sh

# Weekly backup every Sunday at 3:00 AM
0 3 * * 0 /usr/local/bin/pg_backup_weekly.sh

Cluster backup with pg_basebackup (large databases)

Once a database exceeds 10GB, pg_dump starts getting noticeably slow. A 20GB database can take 15–25 minutes depending on I/O — while pg_basebackup copies data files directly, typically finishing the same size in just 5–8 minutes. The gap grows wider as the database scales:

sudo -u postgres pg_basebackup \
    --pgdata=/var/backups/postgresql/basebackup/$(date +%Y%m%d) \
    --format=tar \
    --compress=9 \
    --progress \
    --checkpoint=fast \
    --verbose

# Verify the result
du -sh /var/backups/postgresql/basebackup/$(date +%Y%m%d)

Verification & Monitoring: The most important part people skip

Regular restore testing — non-negotiable

I learned this the hard way: the first time I needed an emergency production restore, I opened the dump file and discovered the script had been silently failing for three weeks because the disk was full. A backup you never test is a backup you don’t have — you’ll only find out whether it works at the worst possible moment.

#!/bin/bash
# /usr/local/bin/pg_test_restore.sh — run weekly

BACKUP_FILE=$(ls -t /var/backups/postgresql/daily/*.dump 2>/dev/null | head -1)
TEST_DB="restore_test_$(date +%Y%m%d)"

if [ -z "$BACKUP_FILE" ]; then
    echo "[FAIL] No backup file found!" >&2
    exit 1
fi

echo "=== Testing restore: $BACKUP_FILE ==="

# Create a temporary test database
sudo -u postgres createdb "$TEST_DB"

# Restore with 4 parallel jobs
sudo -u postgres pg_restore \
    --dbname="$TEST_DB" \
    --jobs=4 \
    --verbose \
    "$BACKUP_FILE"

if [ $? -eq 0 ]; then
    TABLE_COUNT=$(sudo -u postgres psql -Atc "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='public';" "$TEST_DB")
    echo "[OK] Restore successful — tables in public schema: $TABLE_COUNT"
else
    echo "[FAIL] Restore failed! Immediate investigation required." >&2
fi

# Clean up the test database
sudo -u postgres dropdb "$TEST_DB"
echo "Test database dropped."

This script automatically grabs the latest backup, restores it into a temporary database, checks the table count, then tears it down. Run it weekly via cron and log the results. If it fails, alert immediately — don’t wait until production is on fire.

Checking backup file integrity

# Check if the backup file is corrupt (no actual restore needed)
sudo -u postgres pg_restore --list /var/backups/postgresql/daily/myapp_latest.dump > /dev/null 2>&1
if [ $? -eq 0 ]; then
    echo "Backup integrity: OK"
else
    echo "WARNING: Backup file may be corrupt!"
fi

# Review recent backup sizes — an unexpected drop in size is worth investigating
ls -lh /var/backups/postgresql/daily/ | tail -7

A simple but effective trick: track your backup file size day over day. If your backup normally produces 2GB but today’s is only 200MB and nothing was migrated or deleted — that’s a signal to investigate immediately. The script may have only captured partial data before timing out.

Emergency restore procedure

When a real incident hits, your mind can go blank. Keep a runbook ready so you can execute without having to think:

# Step 1: Identify the backup to restore
ls -lt /var/backups/postgresql/daily/ | head -5

# Step 2: Stop the application before restoring (critical!)
systemctl stop myapp

# Step 3: Restore (--clean drops old objects first, --jobs parallelizes the restore)
sudo -u postgres pg_restore \
    --dbname=myapp_production \
    --clean \
    --if-exists \
    --jobs=4 \
    /var/backups/postgresql/daily/myapp_20240301_020000.dump

# Step 4: Quick sanity check before restarting the app
sudo -u postgres psql myapp_production -c "SELECT COUNT(*) FROM users;"
sudo -u postgres psql myapp_production -c "SELECT MAX(created_at) FROM orders;"

# Step 5: Restart the application
systemctl start myapp

The --clean --if-exists flags drop and recreate objects before restoring — preventing conflicts with leftover data. The sanity check in step 4 takes 5 seconds but gives you confidence before flipping the switch back on for your users.

Best practices summary

3-2-1 rule: 3 copies of your backup, 2 different storage types, 1 offsite (S3, Backblaze, etc.)
Test restores weekly: Automate it — don’t wait until you actually need it
Monitor backup file size: An unexpectedly small file = investigate immediately
Use custom format -Fc: Better compression than SQL, supports parallel restore, more flexible
Backup before every migration: Always, no exceptions, even for small changes
Full logging + alerts on failure: Silent backup failures are the worst-case scenario
Clear retention policy: Daily for 7 days, weekly for 4 weeks, monthly for 3 months — balance between storage cost and rollback capability

After years of working with databases, I’ve come to one conclusion: the best backup strategy isn’t the most complex one. It’s the one you’ve actually tested and can trust at 3 AM when everything is going wrong.