Why Running a Load Balancer Without Monitoring is a Death Wish
It’s 2 AM, and my phone starts vibrating uncontrollably. My boss reports: “The site is down, check it now!”. I frantically SSH into each backend cluster, but the services are still running. Only after digging through HAProxy logs did I realize one server was hanging due to RAM issues, causing every request sent there to return a 504 Gateway Timeout. It took over an hour to find the root cause—all because I lacked a proper dashboard.
Operating HAProxy without monitoring is like driving a truck at night with the headlights off. You know the truck is moving, but you don’t know if the engine is overheating at 100 degrees or if a tire is about to blow. We need data that speaks: What is the Requests Per Second (RPS)? Which server is “on life support”? Is the response time stable at 50ms or has it spiked to 5s?
In this article, I will guide you through setting up the HAProxy – Prometheus – Grafana trio so you never have to “search for a needle in a haystack” when an incident occurs.
Operating Model: Exporter – Prometheus – Grafana
HAProxy provides raw data via its stats page, but it doesn’t store history. To get time-series charts, we need a standard workflow:
- HAProxy Exporter: Acts as the “translator.” It fetches data from HAProxy and converts it into a format Prometheus can understand.
- Prometheus: The central brain. Every 15 seconds, it “checks in” with the Exporter to pull the latest data and store it in the database.
- Grafana: The visualization layer. It queries data from Prometheus to draw intuitive charts, helping you spot anomalies in 3 seconds.
Step-by-Step Configuration
Step 1: Enable Metrics on HAProxy
From version 2.0 onwards, HAProxy has a built-in feature to export data for Prometheus. Simply open the /etc/haproxy/haproxy.cfg file and add the following configuration:
frontend stats-monitoring
bind *:8404
http-request use-service prometheus-exporter if { path /metrics }
stats enable
stats uri /stats
stats refresh 10s
Check the syntax and apply the new configuration:
haproxy -c -f /etc/haproxy/haproxy.cfg
systemctl reload haproxy
Now, access http://<Server-IP>:8404/metrics. If you see a long list of lines like haproxy_frontend_bytes_in_total, congratulations—you’ve finished the hardest part.
Step 2: Connect Prometheus to HAProxy
Open the prometheus.yml configuration file. We need to tell Prometheus where to fetch the data from. Add the following configuration to the scrape_configs section:
scrape_configs:
- job_name: 'haproxy_production'
scrape_interval: 15s
static_configs:
- targets: ['<IP-HAProxy>:8404']
After restarting Prometheus, go to Status -> Targets. If the haproxy_production job shows a green UP status, data has begun flowing into your storage.
Step 3: Set Up a Professional Grafana Dashboard
Don’t waste time building charts from scratch. The community has already created excellent dashboards. I recommend using Dashboard ID 12633 (for the built-in version) or 367.
- In the Grafana interface, select the plus (+) icon -> Import.
- Enter ID
12633into the “Import via grafana.com” field. - For the Data Source, select the Prometheus instance you configured.
- Click Import. A full system of charts—from CPU and RAM to Latency—will appear immediately.
4 “Vital” Metrics You Must Monitor
When looking at the dashboard, ignore the clutter and focus on these four parameters:
1. Backend Status: This indicator should always be green. If a backend server turns red, HAProxy will shift the entire load to the remaining servers. If not handled promptly, the entire system could crash due to a domino effect.
2. HTTP Error Rate (4xx & 5xx): Normally, the 5xx error rate should be near 0%. If this number spikes above 1%, it’s a sign of application code errors or Database overload. A sudden surge in 4xx errors is usually a sign of a Brute-force attack or a vulnerability scan.
3. Maxconn & Session Usage: Every server has a capacity limit. If Active Sessions reach 80% of the maxconn limit, users will start experiencing slow page loads because requests are stuck in a queue.
4. Response Time (P99 Latency): Don’t just look at the Average. Look at the P99. If the P99 is 2s, it means 1% of your users are waiting a full 2 seconds for the page to load. This is the most realistic metric reflecting the actual customer experience.
Conclusion
Monitoring isn’t just for aesthetics or reporting to your boss. It’s a tool that helps you sleep better. Instead of waiting for users to complain, you will be the first to know about an incident and resolve it. If you are running a Load Balancer for a real-world project, take 15 minutes to set up this toolkit today. Accurate data is always the key to optimizing your system most effectively.
