Monitoring Apache Kafka with Prometheus and Grafana: Consumer Lag, Throughput, and Topic Health in Production – ITFROMZERO

Table of Contents

Three Ways to Monitor Kafka — and Why None of Them Is a Perfect Fit

Before having proper monitoring in place, I had to SSH into each broker individually to run kafka-consumer-groups.sh and check lag — an incredibly time-consuming process that made early detection nearly impossible. Now I just open the dashboard and can see the entire cluster at a glance, from throughput to consumer lag per group.

But before getting there, I went through three main approaches — each with its own drawbacks.

Approach 1: AKHQ (Kafka HQ) — Pure GUI

AKHQ is a web UI that lets you view topics, consumer groups, and broker status. It’s easy to install, intuitive, and great for exploring your cluster. But the major downside: no time-series data, no alerting, and no integration with your broader alerting system. When something goes wrong at 3 AM, nobody is opening AKHQ to check things manually.

Approach 2: Burrow (by LinkedIn)

Burrow is a dedicated consumer lag monitoring tool with a fairly smart “consumer health” evaluation algorithm — it doesn’t just look at instantaneous lag but also analyzes trends. However, setup is complex, you need an additional adapter to expose metrics to Prometheus, and the community has been less active since around 2022.

Approach 3: JMX Exporter + kafka-exporter with Prometheus

This is the stack I’m currently running in production. The two exporters complement each other: JMX Exporter pulls detailed metrics from Kafka’s JMX interface (broker-level: throughput, request rate, replication lag), while kafka-exporter retrieves consumer group and topic metrics (lag, offset, partition count). Both expose Prometheus-compatible endpoints.

Weighing the Trade-offs to Make the Right Choice

After trying all three, here’s a practical comparison:

AKHQ: Fast to set up, clean UI — but no alerting, no time-series, can’t integrate into a shared observability stack.
Burrow: Smarter consumer health evaluation — but complex to deploy, requires a separate adapter, and is less actively maintained.
JMX Exporter + kafka-exporter: Full metrics coverage, integrates with Prometheus/Grafana/Alertmanager — but requires more configuration steps upfront.

Why I chose the third stack: we already had Prometheus running across our entire infrastructure. Grafana dashboards let us correlate Kafka metrics with server metrics (CPU, memory, disk I/O) — invaluable when debugging a performance issue and not knowing whether the bottleneck is in Kafka or on the consumer service side.

Choosing the Right Stack for Your Situation

Small team, few brokers, need quick visibility: AKHQ is enough to get started.
Want in-depth consumer lag monitoring without complex alerting: Burrow + Burrow-exporter.
Production environment, need alerting, integrating with an existing observability stack: JMX Exporter + kafka-exporter + Prometheus + Grafana.

Deployment Guide: JMX Exporter + kafka-exporter

Step 1: Configure JMX Exporter on the Kafka Broker

JMX Exporter runs as a Java agent attached to the Kafka process. Download the agent jar:

wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar \
  -O /opt/kafka/libs/jmx_prometheus_javaagent.jar

Create the configuration file /opt/kafka/config/kafka-jmx-exporter.yaml with the most important metrics:

startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>OneMinuteRate'
    name: kafka_broker_messages_in_per_sec
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesInPerSec><>OneMinuteRate'
    name: kafka_broker_bytes_in_per_sec
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=BytesOutPerSec><>OneMinuteRate'
    name: kafka_broker_bytes_out_per_sec
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_broker_under_replicated_partitions
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_broker_active_controller_count

Add the JVM flag to the Kafka startup file — typically via the KAFKA_OPTS variable:

export KAFKA_OPTS="-javaagent:/opt/kafka/libs/jmx_prometheus_javaagent.jar=7071:/opt/kafka/config/kafka-jmx-exporter.yaml"

After restarting the broker, verify the endpoint:

curl http://localhost:7071/metrics | grep kafka_broker

Step 2: Deploy kafka-exporter

kafka-exporter is a standalone Go binary that connects to your Kafka cluster to retrieve consumer group and topic metrics:

docker run -d --name kafka-exporter \
  -p 9308:9308 \
  danielqsj/kafka-exporter \
  --kafka.server=kafka-broker-1:9092 \
  --kafka.server=kafka-broker-2:9092 \
  --kafka.server=kafka-broker-3:9092

Verify that consumer lag metrics are being exposed:

curl http://localhost:9308/metrics | grep kafka_consumergroup_lag

Step 3: Configure Prometheus Scrape

Add two jobs to prometheus.yml:

scrape_configs:
  - job_name: 'kafka-jmx'
    static_configs:
      - targets:
          - 'kafka-broker-1:7071'
          - 'kafka-broker-2:7071'
          - 'kafka-broker-3:7071'
    scrape_interval: 30s

  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']
    scrape_interval: 15s

The different scrape intervals are intentional: JMX metrics change slowly, so 30s is sufficient. kafka-exporter needs 15s to catch lag spikes early, before they breach alert thresholds.

Step 4: Import Grafana Dashboards

Two dashboards from Grafana.com work best:

Dashboard ID 7589 — Kafka Overview (uses kafka-exporter metrics, focuses on consumer lag)
Dashboard ID 721 — Kafka Metrics (uses JMX metrics, focuses on broker internals)

To import: Grafana → Dashboards → Import → enter the ID → select your Prometheus datasource.

The Most Important Metrics and How to Write PromQL

Consumer Lag — The #1 Metric You Need to Alert On

Consumer lag is the number of messages a producer has written to a partition that the consumer hasn’t yet processed. A continuously rising lag means the consumer can’t keep up with the producer — this is the first sign of a pipeline problem.

# Total lag for a consumer group broken down by topic
sum(kafka_consumergroup_lag{consumergroup="my-app-group"}) by (topic)

# Alert when lag exceeds the threshold (adjust per group SLA)
alert: KafkaConsumerLagHigh
expr: kafka_consumergroup_lag{consumergroup="payment-consumer"} > 5000
for: 5m

Broker Throughput

# Messages per second across the entire cluster
sum(kafka_broker_messages_in_per_sec) by (instance)

# Bytes in/out for bandwidth monitoring
sum(kafka_broker_bytes_in_per_sec) by (instance)
sum(kafka_broker_bytes_out_per_sec) by (instance)

Topic Health — Under-replicated Partitions

This is the metric I assign the highest alert priority to. When UnderReplicatedPartitions > 0, some replicas are failing to sync with the leader — if a broker dies at that moment, there’s a risk of message loss.

# Alert immediately on any under-replicated partition — no for: delay needed
alert: KafkaUnderReplicatedPartitions
expr: kafka_broker_under_replicated_partitions > 0

# Active controller count must always equal exactly 1 across the cluster
alert: KafkaControllerAbnormal
expr: sum(kafka_broker_active_controller_count) != 1

Practical Tips from a Production Environment

Separate consumer lag alerts by group: don’t alert on total cluster lag — each consumer group has different throughput and SLA requirements. A payment-processing group might need a threshold of 1,000, while an analytics group can tolerate 100,000.
Add Grafana variables: create variables for consumergroup and topic in your dashboard for quick filtering during troubleshooting, instead of having to edit queries every time.
kafka-exporter with SASL: if your cluster has authentication enabled, add the flags --sasl.enabled --sasl.username --sasl.password --sasl.mechanism when running kafka-exporter.
Network partition alert: also monitor the kafka_controller_offline_partitions_count metric — offline partitions often signal network issues between brokers.

Since setting up this stack, whenever we get a “consumer lag spike” alert, our team immediately correlates on the same Grafana dashboard: CPU/memory of the consumer service, broker network throughput, disk I/O. Instead of opening 5 SSH terminals into 5 different servers, all the information is on one screen — debug time dropped from 20 minutes to 3–4 minutes.