MongoDB Goes Silent Until It Dies — A Problem I’ve Run Into
The production database was slowing down at 2 AM. No alerts. No warning logs beforehand. I only found out when users reported the app wouldn’t load — and by then, the replica set primary had been re-elected 3 times overnight.
Before having proper monitoring, my incident response process was: SSH into each server, run db.serverStatus() in the mongo shell, copy numbers into Notepad, compare manually. It took over an hour and I still wasn’t sure what I was looking at. Now I just open the Grafana dashboard and see everything — connection pool, replication lag, slow queries — all real-time on one screen.
This article focuses on the 3 things MongoDB tends to hide most: Replica Set health, Connection Pool exhaustion, and Slow Operations — the real culprits behind “app is slow for no apparent reason.”
Why MongoDB Is Harder to Monitor Than You Think
MongoDB comes with db.serverStatus() and db.currentOp(), but these are point-in-time snapshots. You have to actively query them, and nothing stores historical data to compare trends over time.
Replica Sets add another layer of complexity. The primary can failover at any time. Secondary lag grows silently — nobody notices until read concern is affected and users start seeing stale data.
The connection pool is the most overlooked aspect. By default, the MongoDB driver maintains a fairly large pool size. But when an application leaks connections or experiences a traffic spike, the pool drains in seconds. The error on the app side is usually just a generic timeout — with no mention of MongoDB — so it’s easy to spend another 20-30 minutes debugging the wrong thing.
For slow queries, MongoDB has the Profiler, but enabling it at a high level affects performance. Level 1 — logging only slow ops above 100ms — is a reasonable setting, but you still need a tool to aggregate and visualize trends over time.
Installing the Prometheus MongoDB Exporter
Use the Percona build — the most complete in terms of metrics, with replica set and sharding support:
# Download mongodb_exporter (Percona build)
wget https://github.com/percona/mongodb_exporter/releases/download/v0.40.0/mongodb_exporter-0.40.0.linux-amd64.tar.gz
tar xzf mongodb_exporter-0.40.0.linux-amd64.tar.gz
sudo mv mongodb_exporter /usr/local/bin/
# Create a dedicated MongoDB user for the exporter (in mongo shell)
use admin
db.createUser({
user: "prometheus",
pwd: "strong_password_here",
roles: [
{ role: "clusterMonitor", db: "admin" },
{ role: "read", db: "local" }
]
})
Create a systemd service so the exporter starts automatically on boot:
# Contents of /etc/systemd/system/mongodb_exporter.service
[Unit]
Description=MongoDB Exporter
After=network.target
[Service]
User=prometheus
ExecStart=/usr/local/bin/mongodb_exporter \
--mongodb.uri="mongodb://prometheus:strong_password_here@localhost:27017/admin" \
--collect-all \
--discovering-mode
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
# Enable and verify
sudo systemctl daemon-reload
sudo systemctl enable --now mongodb_exporter
curl -s localhost:9216/metrics | grep mongodb_up
The --collect-all flag enables all collectors including replica set and indexing stats. --discovering-mode automatically detects topology changes — critical when a failover occurs.
Add the scrape job to Prometheus:
# prometheus.yml
scrape_configs:
- job_name: 'mongodb'
static_configs:
- targets: ['localhost:9216']
labels:
instance: 'mongo-primary'
env: 'production'
scrape_interval: 15s
scrape_timeout: 10s
Monitoring Replica Set Health
Replica Set health is the first thing I check during an incident — more important than any other metric in a MongoDB production environment. Rising replication lag means a secondary is falling behind the primary. Reads from that secondary return stale data — and if the primary goes down while lag is high, the failover will lose more data than usual.
# Replication lag per secondary (seconds)
mongodb_mongod_replset_member_replication_lag{state="SECONDARY"}
# Replica set member status (1 = healthy, 0 = down)
mongodb_mongod_replset_member_health
# Number of healthy members — alert if majority is lost
count(mongodb_mongod_replset_member_health == 1) by (set)
# Oplog window (hours) — how long a secondary has to catch up
(mongodb_mongod_replset_oplog_head_timestamp - mongodb_mongod_replset_oplog_tail_timestamp) / 3600
Alert rules should be set up from the start, before incidents happen:
- alert: MongoDBReplicationLagHigh
expr: mongodb_mongod_replset_member_replication_lag > 30
for: 2m
labels:
severity: warning
annotations:
summary: "MongoDB replication lag is high: {{ $value }}s"
description: "Secondary {{ $labels.instance }} is {{ $value }}s behind the primary"
- alert: MongoDBReplicaSetMemberDown
expr: count(mongodb_mongod_replset_member_health == 0) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "MongoDB Replica Set has a member down"
Monitoring the Connection Pool
Connection pool exhaustion rarely surfaces as a direct error from MongoDB. On the app side, you just see timeouts — and then waste another 20-30 minutes debugging the wrong place before thinking about the connection pool. These metrics should be on your dashboard from day one:
# Current and available connections
mongodb_ss_connections{conn_type="current"}
mongodb_ss_connections{conn_type="available"}
# Connection pool usage ratio — alert when > 80%
(
mongodb_ss_connections{conn_type="current"} /
(mongodb_ss_connections{conn_type="current"} + mongodb_ss_connections{conn_type="available"})
) * 100
# New connection creation rate — a sudden spike is a warning sign
rate(mongodb_ss_connections{conn_type="totalCreated"}[5m])
A pattern I often see: connection count spikes suddenly and then drops right away — a classic sign of a connection leak in the application code, not a MongoDB issue. The Grafana graph will draw a very clear sawtooth pattern. This is distinctly different from genuine load increases, which rise gradually with traffic and drop off gradually afterward.
Detecting Slow Operations
Enable MongoDB Profiler at level 1 to capture slow queries without impacting performance:
# In mongo shell — enable profiler level 1 (log slow ops > 100ms)
use myDatabase
db.setProfilingLevel(1, { slowms: 100 })
# Check current configuration
db.getProfilingStatus()
# View recent slow queries directly (quick debug)
db.system.profile.find().sort({ ts: -1 }).limit(10).pretty()
mongodb_exporter exports the scan ratio — a key metric for detecting queries missing an index:
# Documents scanned / documents returned
# High ratio (e.g. 1000:1) = full collection scan, missing index
mongodb_mongod_metrics_query_executor_total{state="scanned"} /
mongodb_mongod_metrics_query_executor_total{state="returned"}
# Operations per second by type
rate(mongodb_ss_opcounters{legacy_op_type="query"}[1m])
rate(mongodb_ss_opcounters{legacy_op_type="update"}[1m])
rate(mongodb_ss_opcounters{legacy_op_type="delete"}[1m])
# Page faults — sign that the working set is larger than RAM
rate(mongodb_ss_extra_info{extra_info_type="page_faults"}[5m])
Importing the Grafana Dashboard
Instead of building from scratch, import a ready-made dashboard from the Grafana marketplace:
- Dashboard ID 14997: MongoDB Overview by Percona — the most complete, includes Replica Set panels and WiredTiger stats
- Dashboard ID 2583: MongoDB Exporter — lighter weight, suitable for standalone instances
Go to Grafana → Dashboards → Import → enter the ID → Load → select your Prometheus datasource.
After importing, add a Connection Pool utilization panel as a Gauge with color thresholds: green below 60%, yellow from 60–80%, red above 80%. A quick glance tells you the status without needing to read the numbers.
Tiering Alerts to Avoid Alarm Fatigue
Set too many alerts and you end up ignoring all of them. I split them into 3 tiers:
- Critical (phone call / PagerDuty): Replica set loses majority votes, replication lag > 60 seconds, available connections < 5%
- Warning (Telegram): Replication lag > 30 seconds, connection usage > 80%, page faults rising continuously for 10 minutes
- Info (Slack): Slow query rate increases > 50% compared to last week’s baseline
One thing many people overlook: set alerts based on rate-of-change rather than static thresholds. 500ms query latency might be perfectly normal for one system but catastrophic for another. PromQL for this:
# Alert when query rate increases 5x compared to the previous hour
rate(mongodb_ss_opcounters{legacy_op_type="query"}[5m]) /
rate(mongodb_ss_opcounters{legacy_op_type="query"}[5m] offset 1h) > 5
mongodb_exporter is lighter than you’d expect — only ~50MB RAM, and scraping every 15 seconds has no noticeable impact on MongoDB. Setup takes a few hours. But in return, the next time something goes wrong at 2 AM, you open Grafana and immediately know what’s happening and when it started — instead of fumbling through SSH into each server while everything is already on fire.

