RabbitMQ Monitoring Mastery: Don’t Wait for a Cluster Crash to Troubleshoot – ITFROMZERO

Table of Contents

The 2 AM Story: When the System Crashes Due to a Clogged Queue

The relentless ringing of PagerDuty in the middle of the night is every operations engineer’s nightmare. One time, our payment system suddenly froze. Orders weren’t hitting the database, and logs were filled with Gateway Timeout errors. After 30 minutes of sweating while SSHing into each node to run rabbitmqctl list_queues, I was shocked: a single queue had a backlog of over 1.5 million messages. This volume consumed all 8GB of RAM, causing the entire Cluster to hang.

Before having a proper monitoring system, I often had to “firefight” like that. Whenever an error occurred, I’d manually check every command line. Now, just by looking at the Grafana dashboard, I can immediately tell which worker is slow or which queue is bloating. If you want to sleep soundly, let’s deploy the Prometheus and Grafana duo to control RabbitMQ proactively.

Quick Deployment in 5 Minutes

Since version 3.8, RabbitMQ has a built-in endpoint for Prometheus. You no longer need to install third-party exporters as before, reducing a potential failure point in your architecture.

Step 1: Enable the Internal Plugin

Enable the monitoring plugin using the following command directly on the RabbitMQ server:

rabbitmq-plugins enable rabbitmq_prometheus

Once activated, RabbitMQ will open port 15692. Use the curl command to verify that data is being exported:

curl http://localhost:15692/metrics

If the screen displays lines like rabbitmq_queue_messages_ready followed by specific numbers, you’ve successfully completed the first step.

Step 2: Register with Prometheus

Add the following configuration to your prometheus.yml file to start periodic data collection (scraping):

scrape_configs:
  - job_name: 'rabbitmq-production'
    scrape_interval: 15s # Too frequent consumes resources, too sparse misses incidents
    static_configs:
      - targets: ['10.0.1.50:15692'] # Replace with your server IP

Step 3: Set up the Grafana Dashboard

Instead of manually creating every chart, use the standard template provided by the RabbitMQ team:

Access Grafana and select Dashboards > Import.
Enter ID 10991 (this is currently the most standard version).
Select the Prometheus Data Source you just configured and click Import.

4 “Vital” Metrics to Watch Closely

The dashboard contains many parameters, but when the system is under high load, focus on these critical metrics:

1. Message Status: Ready vs. Unacknowledged

Ready: Messages waiting in the queue. If this number exceeds 100,000 (depending on your RAM configuration), it’s a sign that consumers are processing too slowly.
Unacknowledged: Messages sent but not yet acknowledged. If this metric is high, it’s likely the worker code is hanging or hitting a logic error, preventing task completion.

2. Message Throughput (Publish vs. Deliver Rate)

This chart represents the system’s balance. If the Publish rate is consistently higher than the Deliver rate for 10-15 minutes, your queue will inevitably overflow. A healthy system is when these two lines closely track each other.

3. File Descriptors (FD)

Every connection from an application to RabbitMQ opens a File Descriptor. By default, Linux often limits this to 1024, which is too low for a messaging system. Ensure your dashboard shows FD usage below 80% to avoid RabbitMQ suddenly rejecting new connections.

4. Erlang Memory Alarm

When RAM reaches the 40% threshold (default), RabbitMQ triggers “freeze” mode (Memory Alarm) and stops accepting messages to protect the system. Don’t let your chart hit this level if you want to avoid mass 500 errors in your application.

Configuring Automated Alerts via Telegram

A dashboard is only effective if you’re watching it. To truly be hands-off, you need Alertmanager. Here is a sample rule for alerting when a queue backlog exceeds 50,000 messages:

- alert: RabbitMQQueueBacklog
  expr: rabbitmq_queue_messages_ready > 50000
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Queue {{ $labels.queue }} is congested!"
    description: "There are currently {{ $value }} unprocessed messages on vhost {{ $labels.vhost }}."

Real-world Lesson: Don’t Let Disk Space Betray You

RabbitMQ has a very strict mechanism: Disk Alarm. If free disk space drops below 50MB, it will block all incoming data. I once encountered a case where the code was fine and RAM was available, but the system couldn’t send messages just because log files filled up the disk.

Advice: Always monitor the disk capacity of your RabbitMQ server. Additionally, for systems with thousands of dynamic queues, you should limit metrics scraping to avoid overloading Prometheus (High Cardinality). Focus only on the most business-critical queues.

Setting up monitoring not only helps you resolve incidents faster but also helps you identify traffic growth trends. From there, you can proactively scale workers before the system “sneezes.” Wishing you peaceful nights with a stable RabbitMQ system!