Grafana Tempo: Setting Up Distributed Tracing and Integrating Loki + Prometheus in a Single Grafana

Monitoring tutorial - IT technology blog
Monitoring tutorial - IT technology blog

When logs and metrics aren’t enough to debug production

I once spent 3 hours debugging a slow request in a microservices system — plenty of logs from Loki, plenty of metrics from Prometheus, but still couldn’t pinpoint which service was the bottleneck. Logs only showed “something is wrong”, metrics showed “latency is up”. But nothing connected them: which services did the request pass through? How long did each step take? Where exactly did it fail?

That’s when I started looking into distributed tracing and set up Grafana Tempo. This article gets straight to the point: installing Tempo from scratch, connecting it with Loki and Prometheus, then configuring it so you can jump from a single trace directly to the related logs and service metrics — all within one Grafana instance.

What is Distributed Tracing and Why Choose Tempo

The simplest way to think about it: distributed tracing is like attaching a GPS to every request. As a request passes through service-auth, service-order, service-payment — each step produces a “span”. All spans are assembled into a single “trace”, showing the complete journey: where it went, where it got stuck, where it failed.

Grafana Tempo stores traces directly in object storage (local, S3, GCS) without indexing — meaning there’s no Elasticsearch or Cassandra to maintain like with Jaeger. The deciding factor for me: Tempo is built to run alongside Loki and Prometheus in a single Grafana, so jumping from traces to logs to metrics is native — no extra plugins or manual link-building required.

Installing Tempo with Docker Compose

Prepare the directory structure

mkdir -p tempo-stack/{tempo,grafana/provisioning/datasources}
cd tempo-stack

Tempo configuration file

Create the file tempo/tempo.yaml:

stream_over_http_enabled: true

server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        http:
          endpoint: 0.0.0.0:4318
        grpc:
          endpoint: 0.0.0.0:4317
    jaeger:
      protocols:
        thrift_http:
          endpoint: 0.0.0.0:14268

ingester:
  max_block_duration: 5m

compactor:
  compaction:
    block_retention: 24h  # increase to 168h for production

storage:
  trace:
    backend: local
    local:
      path: /tmp/tempo/blocks
    wal:
      path: /tmp/tempo/wal

metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: docker-compose
  storage:
    path: /tmp/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true

overrides:
  defaults:
    metrics_generator:
      processors: [service-graphs, span-metrics]
      generate_native_histograms: both

Docker Compose

Create docker-compose.yml:

version: "3.8"

services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo/tempo.yaml:/etc/tempo.yaml
      - tempo-data:/tmp/tempo
    ports:
      - "3200:3200"   # Tempo HTTP API
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "14268:14268" # Jaeger HTTP
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--enable-feature=remote-write-receiver"  # required for Tempo to push metrics
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - grafana-data:/var/lib/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    restart: unless-stopped

volumes:
  tempo-data:
  prometheus-data:
  grafana-data:

One important note: Prometheus must have the flag --enable-feature=remote-write-receiver enabled to receive metrics from the Tempo metrics generator. I forgot this the first time around and spent quite a while figuring out why the service-graph wasn’t showing anything in Grafana.

Configuring Datasources for Correlation to Work

This is the step many people skip after finishing the Tempo install — then wonder why clicking on a trace doesn’t jump to the logs. Create the file grafana/provisioning/datasources/datasources.yaml:

apiVersion: 1

datasources:
  - name: Tempo
    type: tempo
    uid: tempo
    url: http://tempo:3200
    jsonData:
      httpMethod: GET
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: "-1m"
        spanEndTimeShift: "1m"
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: "-1m"
        spanEndTimeShift: "1m"
        tags:
          - key: service.name
            value: service
        queries:
          - name: Request Rate
            query: rate(traces_spanmetrics_calls_total{$$__tags}[5m])
          - name: Error Rate
            query: rate(traces_spanmetrics_calls_total{$$__tags,status_code="STATUS_CODE_ERROR"}[5m])
          - name: Duration P95
            query: histogram_quantile(0.95, sum(rate(traces_spanmetrics_duration_milliseconds_bucket{$$__tags}[5m])) by (le))
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      lokiSearch:
        datasourceUid: loki

  - name: Loki
    type: loki
    uid: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"trace_id":"(\w+)"'
          name: TraceID
          url: "$${__value.raw}"
          urlDisplayLabel: "View Trace in Tempo"

  - name: Prometheus
    type: prometheus
    uid: prometheus
    url: http://prometheus:9090
    isDefault: true

This config does three key things:

  • tracesToLogsV2: From a span in Tempo, automatically queries Loki for logs within the same time window as that span — no need to remember timestamps or filter manually
  • tracesToMetrics: From a trace, jumps to Prometheus to view request rate, error rate, and P95 latency for the service
  • Loki derivedFields: From a log entry containing a trace_id, creates a direct link to the corresponding trace in Tempo

Sending Traces from Your Application to Tempo

Tempo accepts traces via multiple protocols — Jaeger, Zipkin, OTLP. Using OTLP is the best choice as it’s an open standard and most modern frameworks support it out of the box. Example with Python:

pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-fastapi
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "my-api",
    "service.version": "1.0.0",
})

provider = TracerProvider(resource=resource)
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    insecure=True,
)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Usage in code:
with tracer.start_as_current_span("process-order") as span:
    span.set_attribute("order.id", "12345")
    span.set_attribute("user.id", "user-abc")
    # ... business logic here

For Traces → Logs correlation to work, your application logs must include a trace_id. Add this filter to your logger:

import logging
from opentelemetry import trace

class TraceIDFilter(logging.Filter):
    def filter(self, record):
        span = trace.get_current_span()
        ctx = span.get_span_context()
        record.trace_id = format(ctx.trace_id, "032x") if ctx.is_valid else ""
        return True

# JSON format so Loki can parse it and Grafana can extract the trace_id
handler = logging.StreamHandler()
handler.addFilter(TraceIDFilter())
handler.setFormatter(logging.Formatter(
    '{"time": "%(asctime)s", "level": "%(levelname)s", "msg": "%(message)s", "trace_id": "%(trace_id)s"}'
))
logging.getLogger().addHandler(handler)

Testing and Real-World Usage

Start the stack and verify

docker compose up -d

# Check that Tempo is running
curl http://localhost:3200/status

# View received traces (after the app sends some)
curl "http://localhost:3200/api/search?limit=5"

Querying with TraceQL

TraceQL is Tempo’s own query language — the syntax is similar to PromQL but operates on traces instead of metrics. Go to Grafana → Explore → select the Tempo datasource to try it out:

# Find traces for a specific service
{ .service.name = "my-api" }

# Find traces with errors
{ status = error }

# Find traces slower than 1 second
{ duration > 1s }

# Combined: error requests from my-api service, slower than 500ms
{ .service.name = "my-api" && status = error && duration > 500ms }

Real-world debug workflow

Before, when a latency alert fired, I’d open 3–4 tabs side by side: Prometheus for metrics, Loki to grep logs, then manually line up timestamps to figure out which requests were affected. Since setting up this stack, the process now looks like this:

  1. Receive a latency alert from Alertmanager
  2. Grafana Explore → Tempo, filter { duration > 2s } to immediately see which requests are slow
  3. Click into a trace, see which span is consuming the most time
  4. Click “Logs for this span” → Loki automatically filters logs to the exact time window of that span
  5. Click “Metrics” → Prometheus shows request rate and resource usage for that service at that exact moment

Investigation now takes 5 minutes instead of 30, and the team has started trusting alerts again instead of ignoring them.

Things to Keep in Mind for Production

  • Sampling: Don’t send 100% of traces. Head-based sampling at 10–20% is sufficient for most systems. Use tail-based sampling if you want to retain 100% of error traces.
  • Object storage: Replace backend: local with S3 or GCS so data isn’t lost when the container restarts
  • Retention: 24h for dev, 7–30 days for production depending on compliance requirements
# Production: use S3 instead of local
storage:
  trace:
    backend: s3
    s3:
      bucket: my-tempo-traces
      endpoint: s3.amazonaws.com
      region: ap-northeast-1
      access_key: ${S3_ACCESS_KEY}
      secret_key: ${S3_SECRET_KEY}

With this setup, the next time a latency alert fires, you won’t have to spend 3 hours hunting in the dark like I did. From a single trace ID, jumping to the related logs and metrics takes just a few seconds — the full context of an incident right there on one Grafana screen.

Share: