Elastic APM: Deep-Diving into Code and Diagnosing Systems Without the Guesswork

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

Why Prometheus and Grafana Aren’t Enough

My system used to run a Prometheus + Grafana cluster to monitor 15 servers. Everything looked great until one day users complained: “The app is as slow as a turtle.” Looking at the dashboard, CPU was under 20%, RAM was plenty, and every metric was glowing green. That’s when I realized Prometheus only tells you the system is “alive”; it can’t answer the question: Why did a specific request take 10 seconds to process?

Where is the bottleneck? Slow code logic, missing SQL indexes, or a third-party API timeout? To “endoscope” the inner workings of an application, we need APM (Application Performance Monitoring).

Distinguishing Monitoring Layers

Many developers often confuse logs, metrics, and tracing. Let’s distinguish them clearly to avoid misuse:

1. Logging

Use the ELK Stack or Graylog to centralize logs. This method is useful when you already know an error occurred and need to find the detailed cause (stack trace). However, using logs to measure the response time of 1 million requests is practically impossible.

2. Infrastructure Monitoring (Metrics)

This is where Prometheus shines. It focuses on hardware health like CPU, RAM, and Network. It tells you when to buy more servers but is completely blind to the logic inside your code.

3. Performance Monitoring (APM – Tracing)

Tools like New Relic, Datadog, or Elastic APM perform instrumentation (embedding “chips”) into the code. They record a request’s journey from start to finish, passing through functions and database calls until a result is returned. This is the final line of defense for optimizing user experience.

Why I Chose Elastic APM

After trying New Relic and seeing a “burning” bill as traffic grew, I pivoted to Elastic APM. Here are some real-world takeaways:

  • Pros: Extremely smooth integration if you’re already using Elasticsearch as a log server. It supports Distributed Tracing, allowing you to track a request as it traverses 5-7 different microservices. You can clearly see which service is the “culprit” causing delays.
  • Cons: Tracing data is very heavy. If not configured carefully, it can consume hundreds of GBs of disk space daily. Additionally, the Agent causes about 1-3% CPU overhead, but this is entirely acceptable compared to the value it provides.

Basic Deployment Architecture

This model consists of 4 components working in harmony:

  1. APM Agent: Libraries embedded in the code (Python, Node.js, Go…).
  2. APM Server: The relay station that receives data from the Agent and pushes it into Elasticsearch.
  3. Elasticsearch: The massive data store for tracing data.
  4. Kibana: The visual interface for developers to investigate errors and view charts.

Real-world Deployment Guide

We will quickly set up the system using Docker and integrate it into a Python application for demonstration.

Step 1: Initialize the Elastic APM Cluster with Docker

Create a docker-compose.yml file with a minimal configuration:

version: '3.8'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ports:
      - "9200:9200"

  kibana:
    image: docker.elastic.co/kibana/kibana:7.17.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

  apm-server:
    image: docker.elastic.co/apm/apm-server:7.17.0
    ports:
      - "8200:8200"
    command: >
       apm-server -e
         -E apm-server.rum.enabled=true
         -E setup.kibana.host="kibana:5601"
         -E output.elasticsearch.hosts=["elasticsearch:9200"]
    depends_on:
      - elasticsearch
      - kibana

Run the command docker-compose up -d and wait about a minute for the services to fully start.

Step 2: Attach the Agent to a Python Application

Install the official agent library from Elastic:

pip install elastic-apm

For a Flask application, you only need to add a few simple configuration lines:

from flask import Flask
from elasticapm.contrib.flask import ElasticAPM

app = Flask(__name__)
app.config['ELASTIC_APM'] = {
  'SERVICE_NAME': 'order-service',
  'SERVER_URL': 'http://localhost:8200',
  'ENVIRONMENT': 'production',
}

apm = ElasticAPM(app)

Leveraging Data on Kibana

Once traffic starts flowing, open the Observability > APM section. You will see numbers that tell a story:

1. Response Time (Latency)

Pay attention to the p95 and p99 metrics. If p99 is 3 seconds, it means 1% of your customers are experiencing unacceptable delays. This provides the basis for the dev team to prioritize which endpoints to optimize first.

2. Error Rate

APM is smarter than logs because it automatically groups similar errors. Instead of sifting through 1,000 log lines of DB connection errors, you see a single entry with a frequency chart.

3. Tracing Database Queries

This is the most valuable feature. When clicking on a slow request, Kibana displays a waterfall chart. You can see exactly how many milliseconds the query SELECT * FROM orders... took. In fact, my team once discovered a missing index that slowed down the entire system just by using this feature.

Hard-learned Lessons in Operation

Deploying APM for a production system requires attention to 3 key points:

  • Sampling Rate: For systems with 10,000 requests/second, never log 100%. Set transaction_sample_rate to around 0.05 (5% sampling). This reduces the load on Elasticsearch while maintaining statistical significance.
  • Security: Always configure a secret_token for the APM Server. Otherwise, anyone could send junk data and flood your storage.
  • Alerting: Don’t wait to check the dashboard. Set up alerts via Telegram when the Error Rate exceeds a 2% threshold for 5 consecutive minutes.

Mastering Elastic APM isn’t just about installing a tool; it’s about building a data-driven optimization culture. After 3 months of implementation, my team reduced average response times by 40% and completely eliminated “unexplained slowness” issues.

Share: