When logs and metrics aren’t enough to debug production
I once spent 3 hours debugging a slow request in a microservices system — plenty of logs from Loki, plenty of metrics from Prometheus, but still couldn’t pinpoint which service was the bottleneck. Logs only showed “something is wrong”, metrics showed “latency is up”. But nothing connected them: which services did the request pass through? How long did each step take? Where exactly did it fail?
That’s when I started looking into distributed tracing and set up Grafana Tempo. This article gets straight to the point: installing Tempo from scratch, connecting it with Loki and Prometheus, then configuring it so you can jump from a single trace directly to the related logs and service metrics — all within one Grafana instance.
What is Distributed Tracing and Why Choose Tempo
The simplest way to think about it: distributed tracing is like attaching a GPS to every request. As a request passes through service-auth, service-order, service-payment — each step produces a “span”. All spans are assembled into a single “trace”, showing the complete journey: where it went, where it got stuck, where it failed.
Grafana Tempo stores traces directly in object storage (local, S3, GCS) without indexing — meaning there’s no Elasticsearch or Cassandra to maintain like with Jaeger. The deciding factor for me: Tempo is built to run alongside Loki and Prometheus in a single Grafana, so jumping from traces to logs to metrics is native — no extra plugins or manual link-building required.
Installing Tempo with Docker Compose
Prepare the directory structure
mkdir -p tempo-stack/{tempo,grafana/provisioning/datasources}
cd tempo-stack
Tempo configuration file
Create the file tempo/tempo.yaml:
stream_over_http_enabled: true
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
jaeger:
protocols:
thrift_http:
endpoint: 0.0.0.0:14268
ingester:
max_block_duration: 5m
compactor:
compaction:
block_retention: 24h # increase to 168h for production
storage:
trace:
backend: local
local:
path: /tmp/tempo/blocks
wal:
path: /tmp/tempo/wal
metrics_generator:
registry:
external_labels:
source: tempo
cluster: docker-compose
storage:
path: /tmp/tempo/generator/wal
remote_write:
- url: http://prometheus:9090/api/v1/write
send_exemplars: true
overrides:
defaults:
metrics_generator:
processors: [service-graphs, span-metrics]
generate_native_histograms: both
Docker Compose
Create docker-compose.yml:
version: "3.8"
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo/tempo.yaml:/etc/tempo.yaml
- tempo-data:/tmp/tempo
ports:
- "3200:3200" # Tempo HTTP API
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "14268:14268" # Jaeger HTTP
restart: unless-stopped
prometheus:
image: prom/prometheus:latest
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--enable-feature=remote-write-receiver" # required for Tempo to push metrics
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
ports:
- "9090:9090"
restart: unless-stopped
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
- grafana-data:/var/lib/grafana
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
restart: unless-stopped
volumes:
tempo-data:
prometheus-data:
grafana-data:
One important note: Prometheus must have the flag --enable-feature=remote-write-receiver enabled to receive metrics from the Tempo metrics generator. I forgot this the first time around and spent quite a while figuring out why the service-graph wasn’t showing anything in Grafana.
Configuring Datasources for Correlation to Work
This is the step many people skip after finishing the Tempo install — then wonder why clicking on a trace doesn’t jump to the logs. Create the file grafana/provisioning/datasources/datasources.yaml:
apiVersion: 1
datasources:
- name: Tempo
type: tempo
uid: tempo
url: http://tempo:3200
jsonData:
httpMethod: GET
tracesToLogsV2:
datasourceUid: loki
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
filterByTraceID: true
filterBySpanID: false
tracesToMetrics:
datasourceUid: prometheus
spanStartTimeShift: "-1m"
spanEndTimeShift: "1m"
tags:
- key: service.name
value: service
queries:
- name: Request Rate
query: rate(traces_spanmetrics_calls_total{$$__tags}[5m])
- name: Error Rate
query: rate(traces_spanmetrics_calls_total{$$__tags,status_code="STATUS_CODE_ERROR"}[5m])
- name: Duration P95
query: histogram_quantile(0.95, sum(rate(traces_spanmetrics_duration_milliseconds_bucket{$$__tags}[5m])) by (le))
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
lokiSearch:
datasourceUid: loki
- name: Loki
type: loki
uid: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"(\w+)"'
name: TraceID
url: "$${__value.raw}"
urlDisplayLabel: "View Trace in Tempo"
- name: Prometheus
type: prometheus
uid: prometheus
url: http://prometheus:9090
isDefault: true
This config does three key things:
- tracesToLogsV2: From a span in Tempo, automatically queries Loki for logs within the same time window as that span — no need to remember timestamps or filter manually
- tracesToMetrics: From a trace, jumps to Prometheus to view request rate, error rate, and P95 latency for the service
- Loki derivedFields: From a log entry containing a
trace_id, creates a direct link to the corresponding trace in Tempo
Sending Traces from Your Application to Tempo
Tempo accepts traces via multiple protocols — Jaeger, Zipkin, OTLP. Using OTLP is the best choice as it’s an open standard and most modern frameworks support it out of the box. Example with Python:
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-fastapi
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "my-api",
"service.version": "1.0.0",
})
provider = TracerProvider(resource=resource)
otlp_exporter = OTLPSpanExporter(
endpoint="http://localhost:4317",
insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)
# Usage in code:
with tracer.start_as_current_span("process-order") as span:
span.set_attribute("order.id", "12345")
span.set_attribute("user.id", "user-abc")
# ... business logic here
For Traces → Logs correlation to work, your application logs must include a trace_id. Add this filter to your logger:
import logging
from opentelemetry import trace
class TraceIDFilter(logging.Filter):
def filter(self, record):
span = trace.get_current_span()
ctx = span.get_span_context()
record.trace_id = format(ctx.trace_id, "032x") if ctx.is_valid else ""
return True
# JSON format so Loki can parse it and Grafana can extract the trace_id
handler = logging.StreamHandler()
handler.addFilter(TraceIDFilter())
handler.setFormatter(logging.Formatter(
'{"time": "%(asctime)s", "level": "%(levelname)s", "msg": "%(message)s", "trace_id": "%(trace_id)s"}'
))
logging.getLogger().addHandler(handler)
Testing and Real-World Usage
Start the stack and verify
docker compose up -d
# Check that Tempo is running
curl http://localhost:3200/status
# View received traces (after the app sends some)
curl "http://localhost:3200/api/search?limit=5"
Querying with TraceQL
TraceQL is Tempo’s own query language — the syntax is similar to PromQL but operates on traces instead of metrics. Go to Grafana → Explore → select the Tempo datasource to try it out:
# Find traces for a specific service
{ .service.name = "my-api" }
# Find traces with errors
{ status = error }
# Find traces slower than 1 second
{ duration > 1s }
# Combined: error requests from my-api service, slower than 500ms
{ .service.name = "my-api" && status = error && duration > 500ms }
Real-world debug workflow
Before, when a latency alert fired, I’d open 3–4 tabs side by side: Prometheus for metrics, Loki to grep logs, then manually line up timestamps to figure out which requests were affected. Since setting up this stack, the process now looks like this:
- Receive a latency alert from Alertmanager
- Grafana Explore → Tempo, filter
{ duration > 2s }to immediately see which requests are slow - Click into a trace, see which span is consuming the most time
- Click “Logs for this span” → Loki automatically filters logs to the exact time window of that span
- Click “Metrics” → Prometheus shows request rate and resource usage for that service at that exact moment
Investigation now takes 5 minutes instead of 30, and the team has started trusting alerts again instead of ignoring them.
Things to Keep in Mind for Production
- Sampling: Don’t send 100% of traces. Head-based sampling at 10–20% is sufficient for most systems. Use tail-based sampling if you want to retain 100% of error traces.
- Object storage: Replace
backend: localwith S3 or GCS so data isn’t lost when the container restarts - Retention: 24h for dev, 7–30 days for production depending on compliance requirements
# Production: use S3 instead of local
storage:
trace:
backend: s3
s3:
bucket: my-tempo-traces
endpoint: s3.amazonaws.com
region: ap-northeast-1
access_key: ${S3_ACCESS_KEY}
secret_key: ${S3_SECRET_KEY}
With this setup, the next time a latency alert fires, you won’t have to spend 3 hours hunting in the dark like I did. From a single trace ID, jumping to the related logs and metrics takes just a few seconds — the full context of an incident right there on one Grafana screen.

