Monitoring Distributed Applications with OpenTelemetry and Jaeger: Finding Performance Bottlenecks Through Distributed Traces – ITFROMZERO

Table of Contents

When Prometheus Isn’t Enough to Diagnose Issues

Prometheus and Grafana tell you that something is wrong — latency spikes, error rates climbing. But when your system has 8–10 services talking to each other, the harder question is: which service is the problem, and at which step in the request chain?

I once spent nearly 3 hours debugging a request that was 4 seconds slow. Each service’s logs looked fine, metrics showed nothing unusual. Eventually I discovered one service was hitting the database with N+1 queries — but I had to manually grep through 5 services to find it. That’s when I started looking into distributed tracing.

OpenTelemetry + Jaeger solves exactly this problem. Instead of manually grepping logs, you see the entire journey of a request across services as a timeline — each step measured in milliseconds, clearly showing which step is slow and by how much.

Setting Up Jaeger and OpenTelemetry Collector

Use Docker Compose to spin up the stack. The architecture here is: applications send traces to the OTel Collector, which then forwards them to Jaeger. Adding a Collector layer might seem redundant, but it batches spans before sending them, significantly reducing load on Jaeger — especially under high traffic.

Create the docker-compose.yml file

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.57
    ports:
      - "16686:16686"   # Jaeger UI
      - "14250:14250"   # gRPC to receive traces from Collector
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.99.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    depends_on:
      - jaeger

Configure the OpenTelemetry Collector

Create the otel-collector-config.yaml file:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp/jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]

Start the stack:

docker compose up -d
# Check the Jaeger UI at http://localhost:16686

Instrumenting a Python Application with OpenTelemetry

The example below uses a Flask API acting as the order-service — it receives requests from clients, then calls out to inventory-service and pricing-service. Install the required libraries:

pip install opentelemetry-sdk \
            opentelemetry-exporter-otlp-proto-grpc \
            opentelemetry-instrumentation-flask \
            opentelemetry-instrumentation-requests

Initialize the tracer in your application

# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

def setup_tracing(service_name: str):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(
        endpoint="http://localhost:4317",
        insecure=True
    )
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

Integrate into the Flask app

# app.py
from flask import Flask, jsonify
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from tracing import setup_tracing
import requests

app = Flask(__name__)
tracer = setup_tracing("order-service")

# Auto-instrument Flask and the requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

@app.route("/order/<int:order_id>")
def get_order(order_id):
    with tracer.start_as_current_span("fetch-order-details") as span:
        span.set_attribute("order.id", order_id)

        # Child span: call inventory service
        with tracer.start_as_current_span("check-inventory"):
            inventory = requests.get(
                f"http://inventory-service/stock/{order_id}"
            ).json()

        # Child span: call pricing service
        with tracer.start_as_current_span("calculate-price"):
            price = requests.get(
                f"http://pricing-service/price/{order_id}"
            ).json()

        span.set_attribute("order.total", price.get("total", 0))
        return jsonify({"order": order_id, "inventory": inventory, "price": price})

if __name__ == "__main__":
    app.run(port=5000)

The key point: when order-service calls another service over HTTP, RequestsInstrumentor automatically injects the trace ID and span ID into the headers (traceparent). The receiving service reads that header and connects the trace to the correct chain. This means every service in your system must be instrumented — skip one and the trace chain breaks at that point.

Reading Traces in the Jaeger UI

Send 10–20 test requests to the API, then open http://localhost:16686.

Finding the trace for a request

Select the Service: order-service
Click Find Traces
Pick a trace with a high duration to analyze

Jaeger displays the timeline as a waterfall view. You can immediately see which span consumed the most time — for example, if check-inventory took 800ms out of a total 1 second, that’s where you should look first. No guessing, no grepping.

Adding attributes for easier debugging

Custom attributes let you attach business logic context to spans — incredibly useful when you need to find the trace for a specific user or order:

with tracer.start_as_current_span("query-database") as span:
    span.set_attribute("db.query", sql_query)
    span.set_attribute("db.rows_returned", len(results))
    span.set_attribute("user.id", user_id)
    # if an error occurs:
    span.record_exception(exception)
    span.set_status(trace.Status(trace.StatusCode.ERROR))

Filtering traces by condition

Jaeger supports finding traces by tag. For example, find all requests from user 12345, or filter for traces with errors:

# In the Jaeger UI, under Tags:
user.id=12345

# Or find traces with errors:
error=true

Lessons Learned from Production

The first time I set everything up, I enabled 100% trace sampling — meaning every single request was traced. It worked great during testing. In production at ~500 req/s, the Jaeger UI started lagging and the Collector began dropping spans. Limit your sampling rate from the start:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Only trace 10% of requests
sampler = TraceIdRatioBased(0.1)
provider = TracerProvider(resource=resource, sampler=sampler)

Alerting was a similar story. I once set up alerts for every span slower than 500ms — the result was dozens of Telegram alerts per hour, to the point where I muted notifications entirely. It took several rounds of tuning to land on thresholds that actually meant something. A better approach: alert on the end-to-end latency of the entire trace, not individual spans. A span that takes 600ms isn’t necessarily a problem if the total request still comes in under 1 second.

One small thing that saves a lot of time: name your spans using a verb-noun pattern, like fetch-user-profile, insert-order-db, send-notification-email. When a trace has 50 spans, clear names make the waterfall chart much faster to scan than vague names like process or handler.

Integrating with Prometheus (Optional)

The OTel Collector can also export metrics to Prometheus — a single pipeline for both traces and metrics:

# Add to otel-collector-config.yaml
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

My incident workflow now looks like this: spot a latency spike in Grafana → note the timestamp → jump to Jaeger and filter traces for that time window → find traces with unusual duration → drill down into individual spans. From alert to root cause typically takes under 5 minutes, compared to the 3 hours of log grepping I used to do.