How to Install and Configure Thanos to Scale Prometheus: Long-Term Metrics Storage and Global View Across Multiple Clusters – ITFROMZERO

Table of Contents

What Problems Does Thanos Solve That Prometheus Can’t?

I used to run Prometheus across 3 separate Kubernetes clusters. Each cluster had its own Grafana, its own alerts. When a client asked “what were the metrics for service X on production last month?” — I had to open 3 tabs, dig through each cluster, and copy-paste numbers into Excel. Twenty minutes wasted on a simple question.

Prometheus only retains 15 days of data by default and offers no way to view all clusters in one place. Thanos was built to solve exactly these two pain points.

Quick Start: Get Thanos Running in 10 Minutes

The fastest way to try it out is with Docker Compose using MinIO as a local S3-compatible object storage. No AWS account needed, no prerequisites.

Step 1: Create the Bucket Configuration File

mkdir -p ~/thanos-demo && cd ~/thanos-demo

cat > bucket.yaml <<EOF
type: S3
config:
  bucket: thanos
  endpoint: minio:9000
  access_key: minioadmin
  secret_key: minioadmin
  insecure: true
EOF

Step 2: Docker Compose for the Full Stack

# docker-compose.yml
version: '3.8'
services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    ports:
      - "9000:9000"
      - "9001:9001"
    volumes:
      - minio_data:/data

  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.max-block-duration=2h'
      - '--storage.tsdb.min-block-duration=2h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"

  thanos-sidecar:
    image: thanosio/thanos:latest
    command:
      - sidecar
      - --tsdb.path=/prometheus
      - --prometheus.url=http://prometheus:9090
      - --objstore.config-file=/bucket.yaml
      - --http-address=0.0.0.0:10902
      - --grpc-address=0.0.0.0:10901
    volumes:
      - prometheus_data:/prometheus
      - ./bucket.yaml:/bucket.yaml
    depends_on:
      - prometheus
      - minio

  thanos-store:
    image: thanosio/thanos:latest
    command:
      - store
      - --objstore.config-file=/bucket.yaml
      - --http-address=0.0.0.0:10902
      - --grpc-address=0.0.0.0:10901
    volumes:
      - ./bucket.yaml:/bucket.yaml
    depends_on:
      - minio

  thanos-query:
    image: thanosio/thanos:latest
    command:
      - query
      - --http-address=0.0.0.0:9091
      - --endpoint=thanos-sidecar:10901
      - --endpoint=thanos-store:10901
    ports:
      - "9091:9091"
    depends_on:
      - thanos-sidecar
      - thanos-store

volumes:
  minio_data:
  prometheus_data:

Step 3: Create a Minimal prometheus.yml

# prometheus.yml
global:
  scrape_interval: 15s
  external_labels:
    cluster: 'demo'
    replica: '0'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Step 4: Start the Stack and Create the MinIO Bucket

# Start the full stack
docker compose up -d

# Create the 'thanos' bucket in MinIO (using mc client)
docker run --rm --network host \
  minio/mc alias set local http://localhost:9000 minioadmin minioadmin

docker run --rm --network host \
  minio/mc mb local/thanos

That’s it. Open http://localhost:9091 — this is the Thanos Query UI. Any PromQL query you run here will transparently pull data from both the live Prometheus instance and the historical data stored in MinIO.

Thanos Architecture: What Does Each Component Do?

Thanos doesn’t replace Prometheus — it extends it. I usually explain it to colleagues like this: Prometheus is a small local warehouse, and Thanos is the central warehouse system that connects everything together.

Sidecar — The Bridge Between Prometheus and Object Storage

Runs in the same pod as Prometheus. It has two responsibilities: exposing a gRPC endpoint so Thanos Query can pull real-time data, and uploading 2-hour data blocks to S3/MinIO.

Important note: You must set --storage.tsdb.max-block-duration=2h on Prometheus so the Sidecar only uploads fully closed blocks — it won’t touch blocks that are still being written.

Store Gateway — The Door to Historical Storage

Need data older than 15 days? Query asks the Store Gateway. This component pulls blocks from S3, loads the index into memory, and returns results. The actual data isn’t stored locally — only the index is cached for fast lookups.

Query — The Single Query Endpoint

Both Grafana and users point here. Thanos Query automatically decides: ask the Sidecar for recent data, ask the Store Gateway for older data. If multiple Prometheus replicas are scraping the same target, it deduplicates automatically — no manual intervention needed.

Compactor — Cleanup and Cost Savings

Runs independently on a schedule. Its job: merging small blocks into larger ones, and downsampling older data. Data from a year ago doesn’t need 15-second granularity — downsampling to 1-hour resolution is more than enough for trend analysis.

Real-World Kubernetes Deployment Across Multiple Clusters

This is the setup I use most in practice. Each cluster runs its own Prometheus and Sidecar, all pushing to the same S3 bucket. Thanos Query sits in a central cluster and connects to all Sidecars over gRPC. If you haven’t already set up Prometheus monitoring on your Kubernetes clusters, that’s a good prerequisite before layering Thanos on top.

External Labels — Non-Negotiable

# Production Cluster
global:
  external_labels:
    cluster: 'production-sg'
    region: 'ap-southeast-1'

---
# Staging Cluster
global:
  external_labels:
    cluster: 'staging-sg'
    region: 'ap-southeast-1'

Missing external_labels is the most common mistake when getting started with Thanos. Data from 2 clusters flows into the same S3 bucket with no distinguishing labels — Query can’t deduplicate properly, query results get doubled, and alerts fire incorrectly. Don’t skip this step.

Connecting Multiple Clusters to a Single Thanos Query

# Thanos Query connecting Sidecars from multiple clusters
thanos query \
  --http-address=0.0.0.0:9091 \
  --endpoint=prod-sidecar.monitoring.svc:10901 \
  --endpoint=staging-sidecar.monitoring.svc:10901 \
  --endpoint=thanos-store.monitoring.svc:10901 \
  --query.replica-label=replica

Before this setup, I had to SSH into each server to check anything. Now with a single Grafana dashboard pointing to Thanos Query, filtering by the cluster label shows everything — production and staging side by side in the same graph.

Configuring Long-Term Storage on AWS S3 for Production

# bucket-s3.yaml for production
type: S3
config:
  bucket: company-thanos-metrics
  endpoint: s3.ap-southeast-1.amazonaws.com
  region: ap-southeast-1
  # Use IAM Role instead of hardcoded keys when running on EC2/EKS
  access_key: ${AWS_ACCESS_KEY_ID}
  secret_key: ${AWS_SECRET_ACCESS_KEY}

Adding Compactor to Reduce Storage Costs

thanos compact \
  --data-dir=/tmp/thanos-compact \
  --objstore.config-file=/bucket-s3.yaml \
  --retention.resolution-raw=30d \
  --retention.resolution-5m=90d \
  --retention.resolution-1h=1y \
  --wait

This configuration retains raw data for 30 days, 5-minute downsampled data for 90 days, and 1-hour downsampled data for 1 year. On a real system with ~500k active series, I measured roughly a 65% reduction in S3 costs after enabling Compactor with downsampling — compared to keeping 15-second resolution data for an entire year.

Practical Tips from Operational Experience

1. Monitor Thanos Itself with Prometheus

Thanos exposes /metrics on port 10902. Add it to your scrape config so you know when the Compactor gets stuck or the Store Gateway stops syncing new blocks — much better than waiting for a client to report an issue.

scrape_configs:
  - job_name: 'thanos-components'
    static_configs:
      - targets:
          - 'thanos-query:10902'
          - 'thanos-store:10902'
          - 'thanos-sidecar:10902'

2. Shorten Prometheus Retention Once Thanos Is in Place

# Prometheus only needs to keep 2-7 days; Thanos handles the rest
prometheus \
  --storage.tsdb.retention.time=7d \
  --storage.tsdb.max-block-duration=2h \
  --storage.tsdb.min-block-duration=2h

3. Verify That Blocks Have Been Uploaded to S3

# Use thanos tools to inspect the bucket
thanos tools bucket inspect \
  --objstore.config-file=bucket.yaml \
  --output=table

4. Fixing “Duplicate Samples” Errors

Seeing doubled data in Grafana? Add the deduplication flag to Thanos Query:

thanos query \
  --query.replica-label=replica \
  --query.auto-downsampling \
  [other endpoints]

5. Alert When the Sidecar Loses Its S3 Connection

# Alert rule for Thanos
groups:
  - name: thanos
    rules:
      - alert: ThanosSidecarBucketOperationsFailed
        expr: rate(thanos_objstore_bucket_operation_failures_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Thanos Sidecar failed to upload to Object Storage"

Thanos vs Cortex vs VictoriaMetrics — Which Should You Choose?

I get this question all the time. The answer depends on your scale and how much architectural change you’re willing to take on.

Thanos is the best fit if you already have Prometheus — just add a Sidecar and you’re done, no need to touch your existing scrape configuration. VictoriaMetrics handles 5–10x higher throughput than Prometheus on the same hardware and uses significantly less RAM, but requires a full migration. Cortex shines at massive scale (hundreds of millions of samples per second) but needs Cassandra or DynamoDB for its metadata store — the operational overhead is substantial.

Running fewer than 10 clusters with under 1 million active series? Thanos is the most pragmatic choice. There’s no need to over-engineer it.