Three in the morning, your phone buzzes. You open it to find 47 Telegram messages from Alertmanager. Skim through — mostly InstanceDown from the same node doing a rolling restart. Not urgent. But your brain still has to process it. The next morning, exhausted, you check back and find 2 genuinely critical alerts buried in all that noise.
This is alert fatigue. I lived with it for two years before deciding I needed to do something serious about it.
Comparing 3 Approaches to Smarter Alert Handling
Approach 1: Rule-based Filtering in Alertmanager
Almost everyone starts here — adding routes, silences, and inhibition rules in alertmanager.yml.
# alertmanager.yml
routes:
- match:
severity: warning
receiver: 'slack-warnings'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
- match:
severity: critical
receiver: 'pagerduty-oncall'
inhibit_rules:
- source_match:
alertname: NodeDown
target_match:
job: node
equal: ['instance']
Pros: Simple, no additional infrastructure needed, native Alertmanager support.
Cons: Rules must be written by hand and don’t scale as the system grows. Every new alert pattern means opening that YAML file again. More importantly — it filters but doesn’t explain. SysAdmins still have to read each alert to understand the context.
Approach 2: ML-based Anomaly Detection
Tools like Grafana Machine Learning, Metis, or custom-built models using Prophet/LSTM to detect anomalies instead of relying on hard thresholds.
Pros: Smarter, learns patterns over time, reduces false positives.
Cons: Complex setup, requires training data, hard to debug when the model produces wrong results. High operational cost. For a team of 2-3 people, this is overkill.
Approach 3: LLM-based Summarization (what I currently use)
Instead of filtering or learning patterns, use an LLM to read and understand alerts and answer the question that actually matters: “Does this require me to wake up right now? If so, why?”
Pros: No training data needed. And unlike rule-based approaches, LLM understands context — it knows that DiskSpaceRunningLow on a database node is far more dangerous than on a log collector node. Prompts are easy to customize with each team’s domain knowledge.
Cons: Has latency (1-3 seconds per API call). API cost — but for a typical mid-sized system’s alert volume, total cost is just a few dollars per month.
Why I Chose LLM
I tried all three. The rule-based approach ran for 6 months until the alertmanager.yml was nearly 300 lines long and nobody dared touch it. For ML, I tried Grafana ML — results weren’t bad, but after one major deploy that changed alert patterns, the model took nearly 2 weeks to adapt.
LLM wins at something the other two can’t do: it answers human questions, not machine questions. Instead of just telling you “CPU usage > 90%”, it can say: “Database server CPU has been at 95% for 20 minutes, coinciding with the daily backup job window — likely normal, but worth checking if the backup isn’t done in the next 2 hours.”
Rule-based can’t do this. Not because of missing data — but because it doesn’t understand semantics.
System Architecture
Simple flow:
Prometheus → Alertmanager → Webhook Receiver (Python) → LLM API → Telegram/Slack
Alertmanager sends alerts to a small Python server via webhook. This server calls the LLM to classify and summarize them, then forwards the “translated” results to the team’s Telegram.
Practical Deployment Guide
Step 1: Configure the Alertmanager Webhook Receiver
Add a webhook receiver to alertmanager.yml:
receivers:
- name: 'llm-summarizer'
webhook_configs:
- url: 'http://localhost:8080/alert'
send_resolved: true
http_config:
bearer_token: 'your-secret-token'
routes:
- receiver: 'llm-summarizer'
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 2m
repeat_interval: 1h
Important tip: set group_interval: 2m so Alertmanager batches related alerts for 2 minutes before sending — the LLM receives a batch of related alerts instead of individual ones.
Step 2: Python Webhook Server with LLM
Install dependencies:
pip install fastapi uvicorn anthropic httpx python-dotenv
File alert_summarizer.py:
import os
import json
import httpx
import anthropic
from fastapi import FastAPI, Request, HTTPException, Header
from typing import Optional
app = FastAPI()
client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
TELEGRAM_TOKEN = os.environ["TELEGRAM_BOT_TOKEN"]
TELEGRAM_CHAT_ID = os.environ["TELEGRAM_CHAT_ID"]
WEBHOOK_TOKEN = os.environ["WEBHOOK_SECRET_TOKEN"]
SYSTEM_PROMPT = """You are a senior SRE analyzing Prometheus alerts for a production system.
Classify severity: CRITICAL (requires immediate action), WARNING (needs monitoring), INFO (informational).
Provide a concise summary, explain the real-world impact, and suggest the first action to take.
Return JSON format: {"severity": "CRITICAL|WARNING|INFO", "summary": "...", "impact": "...", "action": "..."}"""
def format_alerts_for_llm(alerts: list) -> str:
lines = []
for a in alerts:
labels = a.get("labels", {})
annotations = a.get("annotations", {})
status = a.get("status", "firing")
lines.append(
f"- [{status.upper()}] {labels.get('alertname', 'Unknown')}"
f" | instance={labels.get('instance', 'N/A')}"
f" | severity={labels.get('severity', 'N/A')}"
f" | {annotations.get('description', annotations.get('summary', ''))}"
)
return "\n".join(lines)
async def send_telegram(text: str):
url = f"https://api.telegram.org/bot{TELEGRAM_TOKEN}/sendMessage"
async with httpx.AsyncClient() as client_http:
await client_http.post(url, json={
"chat_id": TELEGRAM_CHAT_ID,
"text": text,
"parse_mode": "Markdown"
})
@app.post("/alert")
async def receive_alert(
request: Request,
authorization: Optional[str] = Header(None)
):
if authorization != f"Bearer {WEBHOOK_TOKEN}":
raise HTTPException(status_code=401)
payload = await request.json()
alerts = payload.get("alerts", [])
if not alerts:
return {"status": "no alerts"}
alert_text = format_alerts_for_llm(alerts)
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Use Haiku to reduce costs
max_tokens=512,
system=SYSTEM_PROMPT,
messages=[{"role": "user", "content": f"Analyze the following alerts:\n{alert_text}"}]
)
result = json.loads(response.content[0].text)
severity = result.get("severity", "INFO")
emoji = {"CRITICAL": "🔴", "WARNING": "🟡", "INFO": "🔵"}.get(severity, "⚪")
message = (
f"{emoji} *{severity}* — {len(alerts)} alert(s)\n\n"
f"*Summary:* {result.get('summary')}\n"
f"*Impact:* {result.get('impact')}\n"
f"*Action:* {result.get('action')}"
)
await send_telegram(message)
return {"status": "ok", "severity": severity}
Run the server:
export ANTHROPIC_API_KEY="sk-ant-..."
export TELEGRAM_BOT_TOKEN="..."
export TELEGRAM_CHAT_ID="-100..."
export WEBHOOK_SECRET_TOKEN="your-secret"
uvicorn alert_summarizer:app --host 0.0.0.0 --port 8080
Step 3: Run as a systemd Service
[Unit]
Description=LLM Alert Summarizer
After=network.target
[Service]
Type=simple
User=prometheus
EnvironmentFile=/etc/alert-summarizer/env
ExecStart=/usr/local/bin/uvicorn alert_summarizer:app --host 0.0.0.0 --port 8080
Restart=always
RestartSec=5
WorkingDirectory=/opt/alert-summarizer
[Install]
WantedBy=multi-user.target
Optimization Tips from Real-World Use
- Use Haiku instead of Sonnet for alert summarization — faster response time (~0.8s vs ~2s), 5x lower cost. With 500 alert events/day, total cost is under $3/month.
- Cache processed alerts: Use Redis or an in-memory dict to store the hash of each alert batch for 30 minutes, avoiding duplicate LLM calls when Alertmanager resends.
- Inject system context into the system prompt: List your most critical services, regular maintenance windows, and normal alert patterns for your team. The LLM will classify much more accurately.
- Set a separate CRITICAL threshold: Only page on-call when severity is CRITICAL. Bundle WARNINGs and INFOs into a morning digest — don’t wake someone up at 3 AM for something that can wait until 9.
Results After 3 Months of Real-World Use
From 40-60 Telegram messages every night down to 3-5 meaningful ones. Over 90% noise reduction. The team started reading alerts again — instead of reflexively muting the bot.
API cost with Claude Haiku: around $2-3/month for a mid-sized system (~500 alert events/day). Much cheaper than unnecessary sleepless nights.
The biggest change wasn’t the numbers. When there’s a genuinely serious alert, it stands out clearly — no longer buried in noise. And I can sleep.

