Self-Hosting SigNoz with Docker: The Open-Source APM Alternative to Datadog – ITFROMZERO

Table of Contents

The Problem: Three Tools, Three Dashboards, One Big Mess

My monitoring setup consisted of Prometheus + Grafana watching 15 servers. It helped catch issues before users reported them — but only solved part of the problem. Specifically, when an API endpoint started slowing down unexpectedly, the debug process looked like this:

Open Grafana to check if CPU/RAM was the issue
Open Jaeger to find distributed traces and see where the request went
SSH into the server and grep logs for errors

Three steps, three windows, three disconnected tools. I’ve burned 30–40 minutes debugging more times than I’d like to admit — not because the problem was complex, but because assembling the full picture from scattered pieces took forever.

On top of that, someone on the team suggested Datadog. One look at the pricing — $23/host/month × 15 servers = $345/month — and the answer was a hard no.

The Root Cause: Missing Observability, Not Missing Monitoring

Monitoring and observability are fundamentally different. Monitoring tells you what is happening (CPU at 90%). Observability tells you why — which request caused it, from which service, and what the corresponding error log says.

The three pillars of observability:

Metrics — time-series statistics (Prometheus handles this well)
Traces — the journey of a request through your services (Jaeger, Zipkin)
Logs — detailed event records (Graylog, ELK)

When these three live in separate systems, correlating them is manual work — and often takes longer than the actual bug fix. SigNoz hits exactly this pain point: it consolidates all three into a single platform, letting you click from a trace to its corresponding log in seconds.

The Alternatives and Why SigNoz Won

Before settling on SigNoz, I evaluated:

Grafana + Tempo + Loki: Powerful, but complex configuration and manual integration between components
Jaeger standalone: Traces only — no metrics or logs
Elastic APM: Heavy, memory-hungry, and complicated licensing for production use
SigNoz: OpenTelemetry-native, single Docker Compose setup, built-in UI, open source (AGPL)

The deciding factor: SigNoz uses ClickHouse as its backend storage instead of Elasticsearch. ClickHouse is optimized for time-series data — aggregation queries over millions of trace rows run several times faster, with roughly 3–5x lower RAM consumption compared to an equivalent Elastic stack.

The Best Approach: Self-Host SigNoz with Docker Compose

System Requirements

RAM: 4GB minimum, 8GB recommended
CPU: 2 cores or more
Disk: 20GB+ (ClickHouse stores trace/log data)
Docker + Docker Compose already installed

Step 1: Clone the Repo and Run

# Clone SigNoz
git clone -b main https://github.com/SigNoz/signoz.git
cd signoz/deploy

# Start the full stack
docker compose -f docker/clickhouse-setup/docker-compose.yaml up -d

The first run will pull around 2–3GB of images. Once done, verify the containers:

docker compose -f docker/clickhouse-setup/docker-compose.yaml ps

# Expected output (all should be "Up"):
# signoz-frontend         Up
# signoz-query-service    Up
# signoz-otel-collector   Up
# clickhouse              Up
# zookeeper               Up

Access the UI at http://<server-ip>:3301. On first launch, you’ll be prompted to create an admin account.

Step 2: Send Data to SigNoz via OpenTelemetry

SigNoz receives data through the OpenTelemetry Collector — an open standard developed by CNCF. SDKs are available for Python, Go, Java, Node.js, and more.

For a Python application (Flask/FastAPI):

# Install OpenTelemetry SDK
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install

# Run the app with auto-instrumentation
SIGNOZ_ACCESS_TOKEN="" \
OTEL_EXPORTER_OTLP_ENDPOINT="http://<signoz-server>:4317" \
OTEL_RESOURCE_ATTRIBUTES="service.name=my-api-service" \
opentelemetry-instrument python app.py

For Node.js (Express):

npm install @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-grpc

// tracing.js — load this before the app starts
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://<signoz-server>:4317',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-node-service',
});
sdk.start();

node -r ./tracing.js server.js

Step 3: Migrate Prometheus Metrics to SigNoz (No Need to Ditch Prometheus)

This is the part I found most elegant — you don’t need to touch your existing Prometheus setup at all. SigNoz scrapes directly from your existing Node Exporter endpoints:

# Add to otel-collector-config.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-servers'
          static_configs:
            - targets:
              - 'server1:9100'   # node_exporter
              - 'server2:9100'
              - 'server3:9100'

exporters:
  clickhousemetricswrite:
    endpoint: tcp://clickhouse:9000

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      exporters: [clickhousemetricswrite]

Restart the collector after changing the config:

docker compose -f docker/clickhouse-setup/docker-compose.yaml restart otel-collector

Step 4: Send Logs to SigNoz

SigNoz receives logs via OTLP. Already using file logs or Docker container logs? The fastest approach is the OpenTelemetry Collector filelog receiver:

receivers:
  filelog:
    include: [/var/log/myapp/*.log]
    start_at: beginning

exporters:
  otlp:
    endpoint: <signoz-server>:4317
    tls:
      insecure: true

service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp]

Tips & Tricks from Real-World Use

1. Limit Retention to Save Disk Space

By default, SigNoz retains data for 15 days. For small teams, 7 days is plenty:

# Access the ClickHouse CLI
docker exec -it signoz-clickhouse clickhouse-client

-- Set 7-day TTL for traces
ALTER TABLE signoz_traces.signoz_index_v2
  MODIFY TTL toDateTime(timestamp) + INTERVAL 7 DAY;

-- Set 7-day TTL for logs
ALTER TABLE signoz_logs.logs
  MODIFY TTL toDateTime(timestamp) + INTERVAL 7 DAY;

2. Set Resource Limits for ClickHouse

On a 4GB RAM server, you should cap it — otherwise ClickHouse will consume 2–2.5GB under heavy queries, crowding out everything else on the system:

# docker-compose.yaml
clickhouse:
  deploy:
    resources:
      limits:
        memory: 2G
      reservations:
        memory: 1G

3. Alert on Rising Error Rates

SigNoz includes built-in alerting — nothing extra to install. To create an alert for API error rate > 5%:

Go to Alerts → New Alert
Select metric: signoz_calls_total with filter status_code = ERROR
Condition: rate > 0.05 over 5 minutes
Notification: Telegram webhook or Slack

4. Sample Rate: 100% for Debugging, 10% for Production

Tracing every request is fine during development. In production, dropping to 10–20% reduces load and keeps disk usage in check:

# Development: trace 100%
OTEL_TRACES_SAMPLER=always_on opentelemetry-instrument python app.py

# Production: trace 10%
OTEL_TRACES_SAMPLER=parentbased_traceidratio \
OTEL_TRACES_SAMPLER_ARG=0.1 \
opentelemetry-instrument python app.py

Results After Setup

The debugging workflow changed completely. Instead of juggling 3 tabs, it now goes like this:

Open Services → identify which service has high p99 latency
Click into the service → Traces → filter by status ERROR
Click a specific trace → see exactly which steps the request went through and where it slowed down
Click the Logs tab right inside the trace view → the logs matching that trace ID appear immediately

From detecting an issue to identifying the root cause: about 5 minutes, down from 30–40 minutes before.

Cost: $0. RAM usage: roughly 3GB for the full SigNoz stack with 5 services sending traces. If you’re already running Prometheus + Grafana and find yourself missing integrated traces and logs — SigNoz is the natural next step. No need to rip anything out; just add a proper observability layer on top.