What Problems Does Thanos Solve That Prometheus Can’t?
I used to run Prometheus across 3 separate Kubernetes clusters. Each cluster had its own Grafana, its own alerts. When a client asked “what were the metrics for service X on production last month?” — I had to open 3 tabs, dig through each cluster, and copy-paste numbers into Excel. Twenty minutes wasted on a simple question.
Prometheus only retains 15 days of data by default and offers no way to view all clusters in one place. Thanos was built to solve exactly these two pain points.
Quick Start: Get Thanos Running in 10 Minutes
The fastest way to try it out is with Docker Compose using MinIO as a local S3-compatible object storage. No AWS account needed, no prerequisites.
Step 1: Create the Bucket Configuration File
mkdir -p ~/thanos-demo && cd ~/thanos-demo
cat > bucket.yaml <<EOF
type: S3
config:
bucket: thanos
endpoint: minio:9000
access_key: minioadmin
secret_key: minioadmin
insecure: true
EOF
Step 2: Docker Compose for the Full Stack
# docker-compose.yml
version: '3.8'
services:
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: minioadmin
MINIO_ROOT_PASSWORD: minioadmin
ports:
- "9000:9000"
- "9001:9001"
volumes:
- minio_data:/data
prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.max-block-duration=2h'
- '--storage.tsdb.min-block-duration=2h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
thanos-sidecar:
image: thanosio/thanos:latest
command:
- sidecar
- --tsdb.path=/prometheus
- --prometheus.url=http://prometheus:9090
- --objstore.config-file=/bucket.yaml
- --http-address=0.0.0.0:10902
- --grpc-address=0.0.0.0:10901
volumes:
- prometheus_data:/prometheus
- ./bucket.yaml:/bucket.yaml
depends_on:
- prometheus
- minio
thanos-store:
image: thanosio/thanos:latest
command:
- store
- --objstore.config-file=/bucket.yaml
- --http-address=0.0.0.0:10902
- --grpc-address=0.0.0.0:10901
volumes:
- ./bucket.yaml:/bucket.yaml
depends_on:
- minio
thanos-query:
image: thanosio/thanos:latest
command:
- query
- --http-address=0.0.0.0:9091
- --endpoint=thanos-sidecar:10901
- --endpoint=thanos-store:10901
ports:
- "9091:9091"
depends_on:
- thanos-sidecar
- thanos-store
volumes:
minio_data:
prometheus_data:
Step 3: Create a Minimal prometheus.yml
# prometheus.yml
global:
scrape_interval: 15s
external_labels:
cluster: 'demo'
replica: '0'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Step 4: Start the Stack and Create the MinIO Bucket
# Start the full stack
docker compose up -d
# Create the 'thanos' bucket in MinIO (using mc client)
docker run --rm --network host \
minio/mc alias set local http://localhost:9000 minioadmin minioadmin
docker run --rm --network host \
minio/mc mb local/thanos
That’s it. Open http://localhost:9091 — this is the Thanos Query UI. Any PromQL query you run here will transparently pull data from both the live Prometheus instance and the historical data stored in MinIO.
Thanos Architecture: What Does Each Component Do?
Thanos doesn’t replace Prometheus — it extends it. I usually explain it to colleagues like this: Prometheus is a small local warehouse, and Thanos is the central warehouse system that connects everything together.
Sidecar — The Bridge Between Prometheus and Object Storage
Runs in the same pod as Prometheus. It has two responsibilities: exposing a gRPC endpoint so Thanos Query can pull real-time data, and uploading 2-hour data blocks to S3/MinIO.
Important note: You must set --storage.tsdb.max-block-duration=2h on Prometheus so the Sidecar only uploads fully closed blocks — it won’t touch blocks that are still being written.
Store Gateway — The Door to Historical Storage
Need data older than 15 days? Query asks the Store Gateway. This component pulls blocks from S3, loads the index into memory, and returns results. The actual data isn’t stored locally — only the index is cached for fast lookups.
Query — The Single Query Endpoint
Both Grafana and users point here. Thanos Query automatically decides: ask the Sidecar for recent data, ask the Store Gateway for older data. If multiple Prometheus replicas are scraping the same target, it deduplicates automatically — no manual intervention needed.
Compactor — Cleanup and Cost Savings
Runs independently on a schedule. Its job: merging small blocks into larger ones, and downsampling older data. Data from a year ago doesn’t need 15-second granularity — downsampling to 1-hour resolution is more than enough for trend analysis.
Real-World Kubernetes Deployment Across Multiple Clusters
This is the setup I use most in practice. Each cluster runs its own Prometheus and Sidecar, all pushing to the same S3 bucket. Thanos Query sits in a central cluster and connects to all Sidecars over gRPC. If you haven’t already set up Prometheus monitoring on your Kubernetes clusters, that’s a good prerequisite before layering Thanos on top.
External Labels — Non-Negotiable
# Production Cluster
global:
external_labels:
cluster: 'production-sg'
region: 'ap-southeast-1'
---
# Staging Cluster
global:
external_labels:
cluster: 'staging-sg'
region: 'ap-southeast-1'
Missing external_labels is the most common mistake when getting started with Thanos. Data from 2 clusters flows into the same S3 bucket with no distinguishing labels — Query can’t deduplicate properly, query results get doubled, and alerts fire incorrectly. Don’t skip this step.
Connecting Multiple Clusters to a Single Thanos Query
# Thanos Query connecting Sidecars from multiple clusters
thanos query \
--http-address=0.0.0.0:9091 \
--endpoint=prod-sidecar.monitoring.svc:10901 \
--endpoint=staging-sidecar.monitoring.svc:10901 \
--endpoint=thanos-store.monitoring.svc:10901 \
--query.replica-label=replica
Before this setup, I had to SSH into each server to check anything. Now with a single Grafana dashboard pointing to Thanos Query, filtering by the cluster label shows everything — production and staging side by side in the same graph.
Configuring Long-Term Storage on AWS S3 for Production
# bucket-s3.yaml for production
type: S3
config:
bucket: company-thanos-metrics
endpoint: s3.ap-southeast-1.amazonaws.com
region: ap-southeast-1
# Use IAM Role instead of hardcoded keys when running on EC2/EKS
access_key: ${AWS_ACCESS_KEY_ID}
secret_key: ${AWS_SECRET_ACCESS_KEY}
Adding Compactor to Reduce Storage Costs
thanos compact \
--data-dir=/tmp/thanos-compact \
--objstore.config-file=/bucket-s3.yaml \
--retention.resolution-raw=30d \
--retention.resolution-5m=90d \
--retention.resolution-1h=1y \
--wait
This configuration retains raw data for 30 days, 5-minute downsampled data for 90 days, and 1-hour downsampled data for 1 year. On a real system with ~500k active series, I measured roughly a 65% reduction in S3 costs after enabling Compactor with downsampling — compared to keeping 15-second resolution data for an entire year.
Practical Tips from Operational Experience
1. Monitor Thanos Itself with Prometheus
Thanos exposes /metrics on port 10902. Add it to your scrape config so you know when the Compactor gets stuck or the Store Gateway stops syncing new blocks — much better than waiting for a client to report an issue.
scrape_configs:
- job_name: 'thanos-components'
static_configs:
- targets:
- 'thanos-query:10902'
- 'thanos-store:10902'
- 'thanos-sidecar:10902'
2. Shorten Prometheus Retention Once Thanos Is in Place
# Prometheus only needs to keep 2-7 days; Thanos handles the rest
prometheus \
--storage.tsdb.retention.time=7d \
--storage.tsdb.max-block-duration=2h \
--storage.tsdb.min-block-duration=2h
3. Verify That Blocks Have Been Uploaded to S3
# Use thanos tools to inspect the bucket
thanos tools bucket inspect \
--objstore.config-file=bucket.yaml \
--output=table
4. Fixing “Duplicate Samples” Errors
Seeing doubled data in Grafana? Add the deduplication flag to Thanos Query:
thanos query \
--query.replica-label=replica \
--query.auto-downsampling \
[other endpoints]
5. Alert When the Sidecar Loses Its S3 Connection
# Alert rule for Thanos
groups:
- name: thanos
rules:
- alert: ThanosSidecarBucketOperationsFailed
expr: rate(thanos_objstore_bucket_operation_failures_total[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Thanos Sidecar failed to upload to Object Storage"
Thanos vs Cortex vs VictoriaMetrics — Which Should You Choose?
I get this question all the time. The answer depends on your scale and how much architectural change you’re willing to take on.
Thanos is the best fit if you already have Prometheus — just add a Sidecar and you’re done, no need to touch your existing scrape configuration. VictoriaMetrics handles 5–10x higher throughput than Prometheus on the same hardware and uses significantly less RAM, but requires a full migration. Cortex shines at massive scale (hundreds of millions of samples per second) but needs Cassandra or DynamoDB for its metadata store — the operational overhead is substantial.
Running fewer than 10 clusters with under 1 million active series? Thanos is the most pragmatic choice. There’s no need to over-engineer it.

