When Grafana dashboards crawl and alerts fire non-stop at 3 AM
When I first set up Prometheus for our system, I thought installing it was the finish line. Dashboard up, Grafana connected, metrics flowing — good enough. But within a few weeks, the team started complaining: dashboards took 10–15 seconds to load, alerts would fire for a few minutes then self-resolve (flapping), and most critically — at 3 AM I’d get a “CPU high” alert, SSH into the server, and find… CPU completely normal.
The problem wasn’t Prometheus. It was how I was using it — specifically two features I’d completely ignored: Recording Rules and properly written Alerting Rules.
Why dashboards are slow and alerts are unreliable
Prometheus stores raw metrics as time series. Every time Grafana renders a panel, it fires a PromQL query at Prometheus — that query may scan millions of data points across your selected time range.
For example, a “5-minute average CPU Usage” panel using this query:
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Looks simple enough — but with 50 servers and 20 panels on a dashboard, every Grafana refresh triggers 20 simultaneous queries, each scanning the full time series for the selected time range. Dashboard lag is inevitable. If you’re designing complex dashboards, mastering Grafana panels, variables, and annotations becomes essential for keeping things performant.
Alert flapping is a different problem: when alert conditions are evaluated directly against raw metrics with no time buffer, a 2-second CPU spike is enough to fire an alert. CPU returns to normal → alert clears → another spike → alert fires again. You get woken up at midnight over a completely harmless short-lived spike.
Recording Rules: Pre-compute once, query fast
The idea behind Recording Rules is straightforward: instead of having Grafana recompute everything from scratch on each refresh, you tell Prometheus to pre-calculate the result on a schedule (default: every minute) and store it as a new metric. Grafana just reads that metric — as fast as querying a single number.
Recording Rules file structure
Create the file /etc/prometheus/rules/recording_rules.yml:
groups:
- name: node_cpu_recording
interval: 1m # Recalculate every 1 minute
rules:
# 5-minute average CPU usage per instance
- record: job:node_cpu_usage:avg5m
expr: |
100 - (
avg by (instance, job) (
rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100
)
# Memory usage percentage per instance
- record: job:node_memory_usage:ratio
expr: |
1 - (
node_memory_MemAvailable_bytes
/ node_memory_MemTotal_bytes
)
# Aggregated disk read/write throughput
- record: job:node_disk_throughput:rate5m
expr: |
sum by (instance, job) (
rate(node_disk_read_bytes_total[5m])
+ rate(node_disk_written_bytes_total[5m])
)
On naming: Prometheus’s official convention is level:metric:operations. For example, job:node_cpu_usage:avg5m reads as “aggregated at the job level, metric cpu usage, operation avg over 5 minutes”. It’s not enforced, but worth following — especially as the team grows and you need to look things up quickly. These metrics come from Node Exporter collecting Linux system metrics, so make sure that’s properly configured before writing Recording Rules against them.
Register the rules file in prometheus.yml
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 1m # How often to evaluate rules
rule_files:
- "/etc/prometheus/rules/*.yml"
Validate rules before reloading
# Use promtool to validate — no Prometheus restart needed
promtool check rules /etc/prometheus/rules/recording_rules.yml
# Reload Prometheus to pick up new rules (zero downtime)
curl -X POST http://localhost:9090/-/reload
After applying Recording Rules, our dashboard load time dropped from 15 seconds to under 2 seconds. No server upgrades, no Prometheus tuning — just using a built-in feature correctly.
Alerting Rules: Alert at the right time, to the right people
Alerting Rules don’t store results like Recording Rules — they evaluate conditions and push to Alertmanager when conditions are met. The key to eliminating flapping lies in the for parameter. It sounds simple, but this is the most commonly overlooked piece I’ve seen.
Standard Alerting Rules structure
# /etc/prometheus/rules/alerting_rules.yml
groups:
- name: node_alerts
rules:
# Only alert after CPU has been high for 10 consecutive minutes — prevents flapping
- alert: HighCPUUsage
expr: job:node_cpu_usage:avg5m > 85
for: 10m
labels:
severity: warning
team: infrastructure
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: |
Instance {{ $labels.instance }} has been above 85% CPU usage
for 10 minutes. Check which processes are consuming heavy resources.
runbook_url: "https://wiki.internal/runbooks/high-cpu"
# RAM nearly full — critical because it directly impacts services
- alert: HighMemoryUsage
expr: job:node_memory_usage:ratio > 0.90
for: 5m
labels:
severity: critical
team: infrastructure
annotations:
summary: "RAM nearly exhausted on {{ $labels.instance }}"
description: |
Memory usage is at {{ $value | printf "%.0f%%" }} on {{ $labels.instance }}.
The system may start swapping or OOM-killing processes.
# Disk filling up — use predict_linear for early warning
- alert: DiskWillFillSoon
expr: |
predict_linear(
node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600
) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk / will fill within 24h on {{ $labels.instance }}"
description: |
Based on the growth rate over the past 6 hours, disk {{ $labels.mountpoint }}
on {{ $labels.instance }} will run out of space within 24 hours.
Key parameter breakdown
for: 10m— The condition must hold continuously for 10 minutes before an alert fires. A 2-second CPU spike doesn’t qualify → no alert. This is the simplest and most effective way to eliminate flapping.labels.severity— Alertmanager uses this label for routing:warningsends a Slack message during business hours,criticalcalls PagerDuty at 3 AM. Getting the classification right from the start prevents on-call burnout. For a full walkthrough of routing alerts to the right channels, see the guide to configuring SMS and Telegram alerts with Alertmanager.annotations.description— Templates with{{ $labels.instance }}and{{ $value }}embed the actual numbers directly in the message. The on-call engineer reads the alert and immediately understands the issue — no SSH required.predict_linear()— Instead of alerting when the disk is already full (too late to react), this function projects based on the growth rate over the past 6 hours. You get 24 hours of warning — enough time to clean up logs or expand the volume.
Combining Recording Rules with Alerting Rules
A pattern I use consistently: Recording Rules compute complex metrics, Alerting Rules consume those metrics for alerting. It’s both fast and consistent — dashboards and alerts use the exact same number, so you’ll never see a dashboard showing 40% CPU while an alert screams 90%.
groups:
- name: http_slo_recording
rules:
# HTTP 5xx error rate over 5 minutes
- record: job:http_error_rate:ratio5m
expr: |
sum by (job, instance) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (job, instance) (
rate(http_requests_total[5m])
)
- name: http_slo_alerts
rules:
# Alert when error rate exceeds the 1% SLO
- alert: HTTPErrorRateTooHigh
expr: job:http_error_rate:ratio5m > 0.01
for: 5m
labels:
severity: critical
annotations:
summary: "High HTTP error rate: {{ $labels.job }}"
description: |
Service {{ $labels.job }} is returning {{ $value | printf "%.2f%%" }} 5xx errors.
SLO allows a maximum of 1%. Check application logs immediately.
Lessons learned from multiple failed deployments
I didn’t get this right on the first try. Here’s the workflow that’s been running stably in production for several months:
- Separate files per group:
recording_node.yml,alerting_node.yml,alerting_http.yml… — when one file breaks, you only roll back that file without affecting the rest. - Run
promtool check rulesin CI/CD before merging pull requests — catch syntax errors early, not at deploy time. - Set
forto at least 5 minutes for warnings, 2–3 minutes for critical — enough to filter out short spikes without delaying response to real incidents. - Use Recording Rules for any query that appears more than twice — whether in dashboards or alert rules, it counts. The DRY principle applies fully to PromQL.
- Always include a
runbook_urlin annotations — link directly to the incident response documentation. A new on-call engineer who just joined the team will know exactly what to do without pinging someone at 2 AM.
Before this, every incident meant SSH-ing into servers one by one to investigate. Now alerts arrive in Telegram with all the information needed — often without opening a laptop. But more valuable than faster response time is trustworthy alerts. The team no longer wonders “is this just flapping?” — every alert that fires means there’s a real issue to address.
