The Real Problem: ELK Stack Is Draining Your Team’s Budget
Last year, our team ran into a serious pain point with our monitoring stack. We were running ELK Stack (Elasticsearch + Logstash + Kibana) to collect logs from around 20 servers — and every month, storage costs kept climbing with no end in sight. Elasticsearch stores data using Lucene indexes, which means disk usage typically runs 3–5x higher than the actual raw log size.
And don’t even get me started on RAM. Elasticsearch needs at least 4GB of heap to run reliably in production. A 3-node cluster means 12GB of RAM just for Elasticsearch alone — not counting Logstash and Kibana. VPS bills grew every month while the team’s budget stayed flat.
Why ELK Is So Expensive for Observability
Elasticsearch was originally designed as a search engine. Repurposing it for observability drags in a lot of unnecessary overhead:
- Lucene’s inverted index is optimized for full-text search — for time-series logs, it’s pure dead weight
- Schema-on-write means you have to define mappings upfront — dynamic mapping can easily cause field explosion if log formats are inconsistent
- The default replication factor requires at least 2 nodes for HA, meaning your data is duplicated on disk
- JVM overhead — heap warm-up time and GC pauses become a constant headache under heavy queries
ELK Stack is powerful, but it was built to be a search engine, not a log aggregator. Using it for observability is like hiring a 10-ton truck to deliver a single box — it gets the job done, but the operating costs are completely out of proportion.
Alternatives We Evaluated
Before settling on a solution, we benchmarked several alternatives:
- Grafana Loki: Much lighter since it only indexes metadata instead of full content. But LogQL has a learning curve, and search performance starts to degrade when data reaches tens of gigabytes without full indexing.
- Graylog: Still uses Elasticsearch under the hood — it just adds a nicer UI on top. The storage problem remains completely unsolved.
- VictoriaMetrics: Excellent for metrics, but at the time, native log support wasn’t mature enough for production use.
- OpenObserve: Handles logs, metrics, and traces in a single binary — with storage costs up to 140x lower than ELK according to their official benchmarks.
OpenObserve — What We’re Actually Running in Production
OpenObserve (formerly ZincObserve) is written in Rust and stores data in Parquet format with high compression. The key differences from ELK come down to a few things:
- A single binary under 10MB — no JVM, no Elasticsearch required
- S3-compatible storage support — store logs on Cloudflare R2 or MinIO at a fraction of the cost of traditional block storage
- Built-in UI for querying logs with SQL, metrics with PromQL, and traces via the OpenTelemetry standard
- Real-world RAM footprint of only around 100–150MB when idle — measured directly on our production server
Installing OpenObserve with Docker
The fastest way to test on a new server:
# Create the data directory
mkdir -p /opt/openobserve/data
# Run OpenObserve
docker run -d \
--name openobserve \
--restart unless-stopped \
-p 5080:5080 \
-e [email protected] \
-e ZO_ROOT_USER_PASSWORD=StrongPass123! \
-e ZO_DATA_DIR=/data \
-v /opt/openobserve/data:/data \
public.ecr.aws/zinclabs/openobserve:latest
Once it’s running, open your browser at http://your-server-ip:5080 and log in with the credentials you just set.
Docker Compose for Production Environments
version: '3.8'
services:
openobserve:
image: public.ecr.aws/zinclabs/openobserve:latest
container_name: openobserve
restart: unless-stopped
ports:
- "5080:5080"
environment:
ZO_ROOT_USER_EMAIL: "[email protected]"
ZO_ROOT_USER_PASSWORD: "ChangeThisToSomethingStrong!"
ZO_DATA_DIR: /data
ZO_TELEMETRY: "false"
volumes:
- openobserve_data:/data
volumes:
openobserve_data:
Start it with:
docker compose up -d
# Check startup logs
docker compose logs -f openobserve
Shipping Logs to OpenObserve with Fluent Bit
Fluent Bit is the lightest log collection agent I’ve ever used — only around 5MB of RAM, compared to Logstash which demands 500MB or more. Install it on Ubuntu/Debian:
curl https://raw.githubusercontent.com/fluent/fluent-bit/master/install.sh | sh
systemctl enable fluent-bit --now
Create a config file to forward logs to OpenObserve:
# /etc/fluent-bit/fluent-bit.conf
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/syslog
Tag server.syslog
Read_from_Head False
[INPUT]
Name tail
Path /var/log/nginx/access.log
Tag nginx.access
Read_from_Head False
[OUTPUT]
Name http
Match *
Host your-openobserve-server
Port 5080
URI /api/default/server_logs/_json
Format json
Http_User [email protected]
Http_Passwd ChangeThisToSomethingStrong!
compress gzip
tls Off
systemctl restart fluent-bit
# Verify logs are being shipped
journalctl -u fluent-bit -f
Querying Logs with SQL in the UI
No need to learn LogQL or KQL — OpenObserve uses SQL syntax, so anyone familiar with relational databases can hit the ground running. Go to the Logs menu, select your stream, and run:
-- Find all ERRORs in the past hour
SELECT * FROM "server_logs"
WHERE log LIKE '%ERROR%'
ORDER BY _timestamp DESC
LIMIT 100
-- Count errors per hour to spot trends
SELECT
date_trunc('hour', _timestamp) AS hour,
count(*) AS error_count
FROM "server_logs"
WHERE log LIKE '%ERROR%'
GROUP BY 1
ORDER BY 1 DESC
Configuring Alerts — Avoiding the Alert Fatigue Trap
This is where I spent the most time when first setting things up. Early on I set thresholds way too low — we were getting pings every 2–3 minutes until the whole DevOps team muted the chat group. Hard lesson learned: thresholds must be based on actual baselines from at least 1–2 weeks of historical logs, not gut feelings.
In OpenObserve, go to Alerts → Create Alert. A sample configuration:
{
"name": "High Error Rate",
"stream_name": "server_logs",
"query": "SELECT count(*) as error_count FROM server_logs WHERE log LIKE '%ERROR%'",
"condition": {
"column": "error_count",
"operator": ">",
"value": 50
},
"duration": 5,
"frequency": 1,
"time_between_alerts": 30
}
The two most important parameters: set time_between_alerts to at least 30 minutes to prevent notification spam. And duration: 5 means the condition must hold true continuously for 5 minutes before triggering — this effectively filters out transient spikes that would otherwise cause false positives.
Alerts can be delivered via Slack, Webhook, or email. Configure destinations under Alerts → Destinations.
Real-World Results After 3 Months on OpenObserve
Same log volume from the same 20 servers — here’s what we measured:
- Disk usage: Down from 180GB/month (ELK) to 13GB/month (OpenObserve) — nearly 14x reduction
- RAM consumption: Down from 12GB (3-node ELK cluster) to 256MB (OpenObserve single node)
- VPS cost: Down from $80/month to $12/month for the same workload
- Setup time: 10 minutes with Docker, versus half a day to set up ELK properly
OpenObserve isn’t a drop-in replacement for Elasticsearch in every scenario. If you need complex full-text search for application data, Elasticsearch is still the right tool. But for observability — collecting, storing, and analyzing infrastructure logs, metrics, and traces — this is something I wish I’d found two years earlier.
