The 2 AM Phone Call and ‘Meaningless’ Numbers
My phone vibrated violently on the table in the middle of the night. Slack was flooded with 504 Gateway Timeout notifications. Checking the application logs, I only saw a long list of Connection pool exhausted errors. I rushed to SSH into the server, typed top, and saw CPU usage spiking to 90%, but I had absolutely no idea which query was hogging resources or how many transactions were hanging.
It felt like driving at high speed through thick fog without headlights. If I had a good enough monitoring system, I would have noticed the sudden spike in connections at 1:45 AM and handled it before the system ‘collapsed.’ This article summarizes those hard-earned lessons, helping you build a sharp ‘pair of eyes’ for PostgreSQL.
Quick Deployment: A Dashboard in 5 Minutes
For quick debugging or testing, Docker Compose is the optimal choice. With just one configuration file, you can spin up the entire monitoring stack without a complicated installation process.
# docker-compose.yml
version: '3.8'
services:
postgres_exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@your_db_ip:5432/postgres?sslmode=disable"
ports:
- "9187:9187"
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana
ports:
- "3000:3000"
Next, create a prometheus.yml file with a minimal configuration to receive data:
scrape_configs:
- job_name: 'postgres'
static_configs:
- targets: ['postgres_exporter:9187']
Run the command docker-compose up -d. At this point, the exporter will start collecting metrics, pushing them to Prometheus, and they will be ready for visualization on Grafana at port 3000.
How it Works: The Role of postgres_exporter
PostgreSQL does not export data in the time-series format that Prometheus requires by default. postgres_exporter acts as a translator. It accesses the Database, executes SQL commands like SELECT * FROM pg_stat_activity, and then converts the results into machine-readable metrics.
Setting Up a Dedicated User for Security
Never use the postgres (superuser) account for monitoring in production. You should create a dedicated user with limited permissions to ensure security:
-- Create a dedicated user for monitoring
CREATE USER monitoring WITH PASSWORD 'your_secure_password';
-- Grant access to system statistics tables
GRANT pg_monitor TO monitoring;
4 ‘Golden’ Metrics You Can’t Ignore
Don’t get overwhelmed by hundreds of charts. Focus on the numbers that truly reflect system health:
- Connections: Compare actual connections with
max_connections. If this number hits the 80% threshold, your system is in the danger zone. - Transaction Throughput: Monitor the Commit/Rollback ratio. If Rollbacks spike (e.g., > 5%), there might be logic errors in the application code or data contention.
- Buffer Cache Hit Ratio: The percentage of data read from RAM. This should stay above 95%. If it drops to 80%, Disk I/O will increase tenfold, noticeably slowing down the DB.
- Database Size & Bloat: Monitor table growth. A 10GB table could contain up to 4GB of ‘junk’ (dead tuples) if the Autovacuum process isn’t working effectively.
During operations, you sometimes need to quickly extract SQL data into other formats for reporting. I often use the CSV to JSON converter tool at toolcraft.app. The plus point is that it processes right in the browser, so you don’t have to worry about leaking sensitive data to third-party servers.
Setting Up Alerts: So You Can Sleep Soundly
A dashboard is only useful when you’re looking at it. But you can’t stare at the screen 24/7. Alerting is the gatekeeper that helps you react before a disaster strikes.
Sample Alert Rule Configuration
Create an alert_rules.yml file to define your warning thresholds:
groups:
- name: postgres_alerts
rules:
- alert: PostgresHighConnections
expr: pg_stat_database_numbackends > (pg_settings_max_connections * 0.8)
for: 5m
labels:
severity: critical
annotations:
summary: "DB is running out of connections on {{ $labels.instance }}"
description: "Connection count has exceeded 80% of the limit for 5 consecutive minutes."
- alert: PostgresDown
expr: pg_up == 0
for: 1m
labels:
severity: fatal
annotations:
summary: "PostgreSQL is down!"
description: "Cannot connect to instance {{ $labels.instance }}. Check immediately!"
When integrated with Slack or Telegram, you’ll receive a message as soon as connections start to rise. Killing ‘idle connections’ at that point is much easier than trying to save a completely frozen DB.
Real-World Deployment Experience
After troubleshooting numerous incidents for large systems, I’ve gathered a few important notes:
- Leverage community dashboards: Don’t build your dashboard from scratch. Go to Grafana Labs and import Dashboard ID
9628. This is the most trusted standard version used by the community. - Scraping Interval: Avoid setting it to 1 second, as every metric collection is a query to the DB. A 15-30 second interval is the perfect balance between accuracy and performance.
- Monitor hardware as well: A slow DB is sometimes caused by disk I/O bottlenecks. Install
node_exporterto monitor CPU, RAM, and Disk latency in parallel. - Combine with pg_stat_statements: Prometheus tells you the DB is slow, but
pg_stat_statementspoints exactly to which SQL query is the culprit. Enable this extension to dive deep into the data.
Setting up monitoring might take a few hours initially, but it will save you weeks of digging through logs later. A professional system doesn’t just run well—it must be measurable.
