Monitoring Elasticsearch and OpenSearch Clusters with Prometheus Exporter and Grafana: Tracking Index Health, Query Latency, and System Resources

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

After 6 months running an Elasticsearch cluster in production with around 50 million documents, I learned something the hard way: monitoring CPU and RAM alone isn’t enough. A cluster can show “all green” in Grafana while query latency quietly climbs in the background. Or worse — JVM heap creeps up to 90% and nobody notices until a node gets OOM killed.

My Prometheus + Grafana setup was already tracking 15 servers, and it had caught plenty of incidents before users ever complained. But when I added Elasticsearch to the stack, I spent nearly 2 weeks fumbling around. This article is what I wish I’d had from day one.

Why Elasticsearch Needs Its Own Monitoring Approach

System metrics like CPU, RAM, and disk I/O only scratch the surface. The real complexity lives inside the cluster:

  • Shard allocation: the cluster keeps running but unassigned shards mean your data is in a dangerous state
  • Query latency: search p99 climbing from 50ms to 2s is a clear sign the index needs optimization — immediately
  • JVM GC pressure: GC pauses longer than 1 second freeze all requests for that entire duration
  • Bulk queue rejection: when the indexing queue fills up, requests get rejected with no clear error logs
  • Disk watermark: Elasticsearch auto-blocks writes when disk free space drops below 5% — catch it late and you lose data

Without application-level monitoring, debugging incidents is like searching for a needle in a haystack.

Comparing the 3 Most Common Approaches

1. X-Pack Stack Monitoring (native)

Elasticsearch has built-in monitoring through Kibana Stack Monitoring. Metrics are stored either in the cluster itself or in a dedicated monitoring cluster.

  • Simple setup: flip a few lines in elasticsearch.yml and you’re done
  • Kibana has a polished UI with a built-in cluster health overview
  • However: it consumes resources from the cluster it’s monitoring, and with OpenSearch (the fork after Elastic changed its license) this feature isn’t available in an equivalent form

2. Metricbeat

Elastic’s agent runs on each node, scrapes metrics, and pushes them to Elasticsearch or Logstash.

  • Plenty of built-in modules, straightforward configuration
  • Tightly coupled to the Elastic stack — doesn’t integrate if you’re already running Prometheus
  • Requires deploying and maintaining an agent on every node — a real operational burden once your cluster grows beyond 10 nodes

3. Prometheus Exporter (elasticsearch_exporter)

The exporter runs as a standalone service, exposes metrics in Prometheus format, and Prometheus scrapes it on a schedule.

  • Plugs directly into an existing Prometheus + Grafana stack
  • A single exporter monitors the entire cluster — no per-node installation needed
  • Works with both Elasticsearch and OpenSearch
  • Community dashboards are readily available on Grafana Labs

Breaking Down the Trade-offs — And Why I Went with Prometheus Exporter

After 2 weeks testing all three approaches on staging, the results were pretty clear:

X-Pack Monitoring makes sense if your team is fully committed to the Elastic stack with a license. But our codebase runs OpenSearch for its more permissive license, so this option got cut early.

Metricbeat is reasonable if you’re already using Beats for log shipping. But adding an agent deployment pipeline for every new node is overhead that a small team doesn’t want to carry.

Prometheus Exporter won for a simple reason: Prometheus + Grafana was already running for those 15 other servers. All I needed was one exporter — no changes to the existing architecture. For OpenSearch, the exporter works just fine too — swap the URI and you’re good to go.

Step-by-Step Deployment

Step 1: Install elasticsearch_exporter

The prometheus-community/elasticsearch_exporter project supports both Docker and binary. Quickest way to get started:

docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  --restart unless-stopped \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest \
  --es.uri=http://elasticsearch:9200 \
  --es.all \
  --es.indices \
  --es.shards

If your cluster has authentication (OpenSearch Security plugin or Elasticsearch xACL):

docker run -d \
  --name elasticsearch-exporter \
  -p 9114:9114 \
  --restart unless-stopped \
  -e ES_USERNAME=monitoring_user \
  -e ES_PASSWORD=your_password \
  quay.io/prometheuscommunity/elasticsearch-exporter:latest \
  --es.uri=https://elasticsearch:9200 \
  --es.all \
  --es.indices \
  --es.clusterinfo.interval=5m

Verify the exporter is running and exposing metrics:

curl http://localhost:9114/metrics | grep elasticsearch_cluster_health

If you see output like elasticsearch_cluster_health_status{color="green"} 1, the exporter has successfully connected to your cluster.

Step 2: Configure the Prometheus Scrape Job

Add the job to prometheus.yml:

scrape_configs:
  - job_name: 'elasticsearch'
    scrape_interval: 30s
    scrape_timeout: 25s
    static_configs:
      - targets:
          - 'elasticsearch-exporter:9114'
        labels:
          cluster: 'prod-es-cluster'
          env: 'production'

Reload Prometheus without a restart:

curl -X POST http://localhost:9090/-/reload

Open the Prometheus UI and query elasticsearch_cluster_health_status to confirm data is flowing in correctly.

Step 3: Import the Grafana Dashboard

Dashboard ID 14191 on Grafana Labs is the most widely used community dashboard for elasticsearch_exporter:

  1. Grafana → Dashboards → Import
  2. Enter ID 14191 → Load
  3. Select your Prometheus data source → Import

Out of the box you get: cluster health status, node count, JVM heap usage, indexing rate, search rate, GC collection time, and disk usage per node.

Step 4: Set Up Alerts for Critical Metrics

These are the alert rules I run directly in production — add them to your Prometheus rules file:

groups:
  - name: elasticsearch_alerts
    rules:
      # Cluster status is not green
      - alert: ElasticsearchClusterRed
        expr: elasticsearch_cluster_health_status{color="red"} == 1
        for: 1m
        labels:
          severity: critical

      - alert: ElasticsearchClusterYellow
        expr: elasticsearch_cluster_health_status{color="yellow"} == 1
        for: 5m
        labels:
          severity: warning

      # Unassigned shards
      - alert: ElasticsearchUnassignedShards
        expr: elasticsearch_cluster_health_unassigned_shards > 0
        for: 5m
        labels:
          severity: warning

      # JVM Heap > 85%
      - alert: ElasticsearchHighJVMHeap
        expr: |
          elasticsearch_jvm_memory_used_bytes{area="heap"} /
          elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
        for: 3m
        labels:
          severity: critical

      # Disk running low (< 20% free)
      - alert: ElasticsearchLowDiskSpace
        expr: |
          elasticsearch_filesystem_data_available_bytes /
          elasticsearch_filesystem_data_size_bytes < 0.20
        for: 5m
        labels:
          severity: warning

Step 5: Create a Monitoring User with Minimal Permissions

Never use an admin account for the exporter — create a dedicated user instead:

# Elasticsearch - create role and user via API
curl -X PUT "http://localhost:9200/_security/role/monitoring_role" \
  -H 'Content-Type: application/json' \
  -u elastic:password \
  -d '{
    "cluster": ["monitor"],
    "indices": [{"names": ["*"], "privileges": ["monitor", "view_index_metadata"]}]
  }'

curl -X PUT "http://localhost:9200/_security/user/monitoring_user" \
  -H 'Content-Type: application/json' \
  -u elastic:password \
  -d '{"password": "strong_password", "roles": ["monitoring_role"]}'

For OpenSearch, do the same through the OpenSearch Security REST API or the Dashboard UI.

Lessons Learned After 6 Months in Production

A 30s scrape interval is sufficient for most use cases. Drop to 15s if you need finer granularity for debugging, but the exporter will make more ES API calls and consume more cluster resources.

The --es.shards flag can cause timeouts on large clusters. The first time I enabled it on a cluster with 800 shards, the exporter timed out constantly. The fix: add --es.timeout=30s, or just disable the flag if you don’t need shard-level detail.

Dashboard 14191 doesn’t cover OpenSearch-specific metrics like security audit logs or anomaly detection job status. If you’re using OpenSearch’s extended features, you’ll need to build additional panels from the raw metrics yourself.

What really sold me on this setup was an Alertmanager Telegram message at 3 AM: JVM heap was sitting at 88% on node 2. The team had exactly 30 minutes to restart the node and bump the heap config before Japanese users logged in for the morning shift. Without that alert, we’d have been looking at guaranteed downtime — and a very uncomfortable 6 AM phone call.

My favorite thing about this approach: there’s nothing new to learn. If your team already knows Prometheus and Grafana, you can have complete Elasticsearch monitoring — shard allocation, JVM heap, disk watermark, and alerts — in 1 to 2 hours.

Share: