Self-hosting Grafana OnCall: Professional On-call Scheduling and Escalation for DevOps Teams

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

The Nightmare of “Alert Fatigue” and the Rise of Grafana OnCall

If you’ve ever worked in Ops, you’ve likely experienced your phone vibrating at 2 AM with over 50 alerts from Alertmanager flooding into Telegram. At that moment, the first question isn’t “what’s the error?”, but: “Who is on call? Who is handling this incident?”. Without a clear assignment process, the result is usually the whole team waking up, or worse, no one doing anything because they assume someone else has it covered.

My monitoring system tracks 15 servers and 40 microservices. This setup helps detect incidents quickly, but as we scaled, I realized just receiving alerts wasn’t enough. I needed a tool for managing on-call schedules and automated escalation if the primary responder didn’t react. The solution I chose was Grafana OnCall.

Previously, OnCall was only available on the paid Cloud version. Now, Grafana has open-sourced the self-hosted version. It’s an excellent choice to replace PagerDuty or Opsgenie, saving you at least $20/user/month while keeping all your data on-premise.

System Requirements and Preparation

To ensure OnCall runs smoothly without lag when processing thousands of events, you should prepare:

  • Linux Server (Ubuntu 22.04 is the most stable).
  • Latest version of Docker and Docker Compose.
  • Grafana (version 9.0 or higher).
  • Domain and SSL certificate (mandatory if you want to receive external Webhooks).

Regarding hardware, prioritize 2 vCPUs, 4GB RAM, and 20GB SSD. OnCall doesn’t run alone; it brings an “army” of microservices including Redis, PostgreSQL, Celery, and RabbitMQ.

Steps to Install Grafana OnCall Using Docker Compose

The fastest way is to use the official Grafana repo. Simply clone it and adjust a few environment variables.

# Clone official repo
git clone https://github.com/grafana/oncall.git
cd oncall

# Create environment file from template
cp .env.example .env

Next, open the .env file. This is the most critical step to allow the system to communicate with the outside world:

# OnCall access domain
EXTERNAL_URL=https://oncall.company.com

# Security key (generate a string longer than 32 characters)
SECRET_KEY=dont-use-the-default-key-guys

# Connect to existing Grafana
GRAFANA_API_URL=http://grafana-internal:3000

After saving the configuration, just run this command:

docker-compose up -d

Wait about 2 minutes for the database to initialize. You can check the status with docker-compose ps. If all containers show Up (healthy), you’re halfway there.

Configuring OnCall: From Schedules to Escalation Chains

Upon accessing the OnCall interface integrated into Grafana, the first thing you need to do is set up Users and Teams.

1. Setting up Schedules

OnCall allows for flexible rotating schedules. I usually divide the team into two groups: Primary and Shadow. The drag-and-drop interface is very intuitive. A major plus is that OnCall supports exporting schedules to iCal. Members can sync directly to Google Calendar on their phones to know exactly when they are “on the hook”.

2. Escalation Chains

This is the heart of the system. Here is a practical workflow I apply to production systems:

  • Minute 0: Send a Telegram notification to the primary on-call person.
  • Minute 5: If “Acknowledge” hasn’t been clicked, the system automatically calls via Twilio.
  • Minute 15: Still no response? Automatically ping the entire team and Leader on Slack/Telegram.

This mechanism completely eliminates missed alerts due to oversleeping or internet loss.

Connecting Telegram for 24/7 Alerts

In the Chat Ops menu of OnCall, enter the Token in the Telegram section. Telegram is an excellent channel for receiving notifications due to its speed and free API. You need to create a Bot via @BotFather to get a Token.

Each member needs to chat /start with the Bot to link their account. Afterward, the Bot will send messages with Ack (Acknowledge) or Resolve buttons right below the alert content. Very convenient.

# Example Webhook configuration
Telegram Bot Token: 7123456789:AAH_... 
Telegram Webhook URL: https://oncall.yourdomain.com/telegram/webhook

Real-world Operational Experience

Don’t blindly trust the system right after installation. I always perform a “simulated incident” test by lowering the CPU alert threshold to 5%. If the message reaches the right person at the right time and follows the escalation steps correctly, only then can you rest easy.

A few hard-learned lessons for you:

  • Timezone Synchronization: Ensure the server, Grafana, and OnCall are in the same timezone (e.g., Asia/Ho_Chi_Minh). Timezone mismatches will cause schedules to shift unexpectedly.
  • Monitor OnCall itself: Since this is your last line of defense, it needs monitoring too. I typically use a simple external script to check OnCall’s health-check URL every minute.
  • Database Cleanup: OnCall stores a lot of event logs. Set up a policy to automatically delete old data after 30 days to avoid filling up the server disk.

Implementing Grafana OnCall has significantly relieved the psychological pressure on my team. Whoever is on call handles it, while the rest can sleep soundly without fear of being called out for no reason in the middle of the night.

Share: