Monitoring HAProxy with Prometheus & Grafana: Don’t Wait for a Crash to Act

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

Why Running a Load Balancer Without Monitoring is a Death Wish

It’s 2 AM, and my phone starts vibrating uncontrollably. My boss reports: “The site is down, check it now!”. I frantically SSH into each backend cluster, but the services are still running. Only after digging through HAProxy logs did I realize one server was hanging due to RAM issues, causing every request sent there to return a 504 Gateway Timeout. It took over an hour to find the root cause—all because I lacked a proper dashboard.

Operating HAProxy without monitoring is like driving a truck at night with the headlights off. You know the truck is moving, but you don’t know if the engine is overheating at 100 degrees or if a tire is about to blow. We need data that speaks: What is the Requests Per Second (RPS)? Which server is “on life support”? Is the response time stable at 50ms or has it spiked to 5s?

In this article, I will guide you through setting up the HAProxy – Prometheus – Grafana trio so you never have to “search for a needle in a haystack” when an incident occurs.

Operating Model: Exporter – Prometheus – Grafana

HAProxy provides raw data via its stats page, but it doesn’t store history. To get time-series charts, we need a standard workflow:

  • HAProxy Exporter: Acts as the “translator.” It fetches data from HAProxy and converts it into a format Prometheus can understand.
  • Prometheus: The central brain. Every 15 seconds, it “checks in” with the Exporter to pull the latest data and store it in the database.
  • Grafana: The visualization layer. It queries data from Prometheus to draw intuitive charts, helping you spot anomalies in 3 seconds.

Step-by-Step Configuration

Step 1: Enable Metrics on HAProxy

From version 2.0 onwards, HAProxy has a built-in feature to export data for Prometheus. Simply open the /etc/haproxy/haproxy.cfg file and add the following configuration:

frontend stats-monitoring
    bind *:8404
    http-request use-service prometheus-exporter if { path /metrics }
    stats enable
    stats uri /stats
    stats refresh 10s

Check the syntax and apply the new configuration:

haproxy -c -f /etc/haproxy/haproxy.cfg
systemctl reload haproxy

Now, access http://<Server-IP>:8404/metrics. If you see a long list of lines like haproxy_frontend_bytes_in_total, congratulations—you’ve finished the hardest part.

Step 2: Connect Prometheus to HAProxy

Open the prometheus.yml configuration file. We need to tell Prometheus where to fetch the data from. Add the following configuration to the scrape_configs section:

scrape_configs:
  - job_name: 'haproxy_production'
    scrape_interval: 15s
    static_configs:
      - targets: ['<IP-HAProxy>:8404']

After restarting Prometheus, go to Status -> Targets. If the haproxy_production job shows a green UP status, data has begun flowing into your storage.

Step 3: Set Up a Professional Grafana Dashboard

Don’t waste time building charts from scratch. The community has already created excellent dashboards. I recommend using Dashboard ID 12633 (for the built-in version) or 367.

  1. In the Grafana interface, select the plus (+) icon -> Import.
  2. Enter ID 12633 into the “Import via grafana.com” field.
  3. For the Data Source, select the Prometheus instance you configured.
  4. Click Import. A full system of charts—from CPU and RAM to Latency—will appear immediately.

4 “Vital” Metrics You Must Monitor

When looking at the dashboard, ignore the clutter and focus on these four parameters:

1. Backend Status: This indicator should always be green. If a backend server turns red, HAProxy will shift the entire load to the remaining servers. If not handled promptly, the entire system could crash due to a domino effect.

2. HTTP Error Rate (4xx & 5xx): Normally, the 5xx error rate should be near 0%. If this number spikes above 1%, it’s a sign of application code errors or Database overload. A sudden surge in 4xx errors is usually a sign of a Brute-force attack or a vulnerability scan.

3. Maxconn & Session Usage: Every server has a capacity limit. If Active Sessions reach 80% of the maxconn limit, users will start experiencing slow page loads because requests are stuck in a queue.

4. Response Time (P99 Latency): Don’t just look at the Average. Look at the P99. If the P99 is 2s, it means 1% of your users are waiting a full 2 seconds for the page to load. This is the most realistic metric reflecting the actual customer experience.

Conclusion

Monitoring isn’t just for aesthetics or reporting to your boss. It’s a tool that helps you sleep better. Instead of waiting for users to complain, you will be the first to know about an incident and resolve it. If you are running a Load Balancer for a real-world project, take 15 minutes to set up this toolkit today. Accurate data is always the key to optimizing your system most effectively.

Share: