When Prometheus Isn’t Enough to Diagnose Issues
Prometheus and Grafana tell you that something is wrong — latency spikes, error rates climbing. But when your system has 8–10 services talking to each other, the harder question is: which service is the problem, and at which step in the request chain?
I once spent nearly 3 hours debugging a request that was 4 seconds slow. Each service’s logs looked fine, metrics showed nothing unusual. Eventually I discovered one service was hitting the database with N+1 queries — but I had to manually grep through 5 services to find it. That’s when I started looking into distributed tracing.
OpenTelemetry + Jaeger solves exactly this problem. Instead of manually grepping logs, you see the entire journey of a request across services as a timeline — each step measured in milliseconds, clearly showing which step is slow and by how much.
Setting Up Jaeger and OpenTelemetry Collector
Use Docker Compose to spin up the stack. The architecture here is: applications send traces to the OTel Collector, which then forwards them to Jaeger. Adding a Collector layer might seem redundant, but it batches spans before sending them, significantly reducing load on Jaeger — especially under high traffic.
Create the docker-compose.yml file
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.57
ports:
- "16686:16686" # Jaeger UI
- "14250:14250" # gRPC to receive traces from Collector
environment:
- COLLECTOR_OTLP_ENABLED=true
otel-collector:
image: otel/opentelemetry-collector-contrib:0.99.0
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
depends_on:
- jaeger
Configure the OpenTelemetry Collector
Create the otel-collector-config.yaml file:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
otlp/jaeger:
endpoint: jaeger:14250
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]
Start the stack:
docker compose up -d
# Check the Jaeger UI at http://localhost:16686
Instrumenting a Python Application with OpenTelemetry
The example below uses a Flask API acting as the order-service — it receives requests from clients, then calls out to inventory-service and pricing-service. Install the required libraries:
pip install opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-grpc \
opentelemetry-instrumentation-flask \
opentelemetry-instrumentation-requests
Initialize the tracer in your application
# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
def setup_tracing(service_name: str):
resource = Resource.create({"service.name": service_name})
provider = TracerProvider(resource=resource)
exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True
)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
return trace.get_tracer(service_name)
Integrate into the Flask app
# app.py
from flask import Flask, jsonify
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from tracing import setup_tracing
import requests
app = Flask(__name__)
tracer = setup_tracing("order-service")
# Auto-instrument Flask and the requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
@app.route("/order/<int:order_id>")
def get_order(order_id):
with tracer.start_as_current_span("fetch-order-details") as span:
span.set_attribute("order.id", order_id)
# Child span: call inventory service
with tracer.start_as_current_span("check-inventory"):
inventory = requests.get(
f"http://inventory-service/stock/{order_id}"
).json()
# Child span: call pricing service
with tracer.start_as_current_span("calculate-price"):
price = requests.get(
f"http://pricing-service/price/{order_id}"
).json()
span.set_attribute("order.total", price.get("total", 0))
return jsonify({"order": order_id, "inventory": inventory, "price": price})
if __name__ == "__main__":
app.run(port=5000)
The key point: when order-service calls another service over HTTP, RequestsInstrumentor automatically injects the trace ID and span ID into the headers (traceparent). The receiving service reads that header and connects the trace to the correct chain. This means every service in your system must be instrumented — skip one and the trace chain breaks at that point.
Reading Traces in the Jaeger UI
Send 10–20 test requests to the API, then open http://localhost:16686.
Finding the trace for a request
- Select the Service:
order-service - Click Find Traces
- Pick a trace with a high duration to analyze
Jaeger displays the timeline as a waterfall view. You can immediately see which span consumed the most time — for example, if check-inventory took 800ms out of a total 1 second, that’s where you should look first. No guessing, no grepping.
Adding attributes for easier debugging
Custom attributes let you attach business logic context to spans — incredibly useful when you need to find the trace for a specific user or order:
with tracer.start_as_current_span("query-database") as span:
span.set_attribute("db.query", sql_query)
span.set_attribute("db.rows_returned", len(results))
span.set_attribute("user.id", user_id)
# if an error occurs:
span.record_exception(exception)
span.set_status(trace.Status(trace.StatusCode.ERROR))
Filtering traces by condition
Jaeger supports finding traces by tag. For example, find all requests from user 12345, or filter for traces with errors:
# In the Jaeger UI, under Tags:
user.id=12345
# Or find traces with errors:
error=true
Lessons Learned from Production
The first time I set everything up, I enabled 100% trace sampling — meaning every single request was traced. It worked great during testing. In production at ~500 req/s, the Jaeger UI started lagging and the Collector began dropping spans. Limit your sampling rate from the start:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Only trace 10% of requests
sampler = TraceIdRatioBased(0.1)
provider = TracerProvider(resource=resource, sampler=sampler)
Alerting was a similar story. I once set up alerts for every span slower than 500ms — the result was dozens of Telegram alerts per hour, to the point where I muted notifications entirely. It took several rounds of tuning to land on thresholds that actually meant something. A better approach: alert on the end-to-end latency of the entire trace, not individual spans. A span that takes 600ms isn’t necessarily a problem if the total request still comes in under 1 second.
One small thing that saves a lot of time: name your spans using a verb-noun pattern, like fetch-user-profile, insert-order-db, send-notification-email. When a trace has 50 spans, clear names make the waterfall chart much faster to scan than vague names like process or handler.
Integrating with Prometheus (Optional)
The OTel Collector can also export metrics to Prometheus — a single pipeline for both traces and metrics:
# Add to otel-collector-config.yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
My incident workflow now looks like this: spot a latency spike in Grafana → note the timestamp → jump to Jaeger and filter traces for that time window → find traces with unusual duration → drill down into individual spans. From alert to root cause typically takes under 5 minutes, compared to the 3 hours of log grepping I used to do.
