I’ve worked with MySQL, PostgreSQL, and MongoDB across different projects — each has its strengths. But when our team needed to analyze 500 million access log rows per day, they all started to buckle. A GROUP BY query on PostgreSQL took 40 seconds; MySQL was even worse. Then I tried ClickHouse — that same query finished in 0.3 seconds. I never looked back.
Install and Run in 5 Minutes
Install ClickHouse on Ubuntu/Debian:
sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client
Start the service and connect right away:
sudo systemctl start clickhouse-server
sudo systemctl enable clickhouse-server
clickhouse-client
Create a table and insert 1 million rows of dummy data:
CREATE TABLE access_logs (
event_time DateTime,
user_id UInt32,
page String,
status_code UInt16,
response_ms UInt32
) ENGINE = MergeTree()
ORDER BY (event_time, user_id);
-- Insert 1 million test rows
INSERT INTO access_logs
SELECT
now() - randBinomial(1000000, 0.5),
rand() % 10000,
arrayElement(['/home', '/api', '/login', '/product'], rand() % 4 + 1),
arrayElement([200, 301, 404, 500], rand() % 4 + 1),
rand() % 2000
FROM numbers(1000000);
Now run a test query:
SELECT page, count() AS hits, avg(response_ms) AS avg_ms
FROM access_logs
WHERE status_code = 200
GROUP BY page
ORDER BY hits DESC;
Results come back in under a second. That’s ClickHouse.
Why Is ClickHouse So Fast?
The secret is columnar storage. MySQL and PostgreSQL store data row by row — reading one row pulls in every column. ClickHouse does the opposite: it stores data column by column, so a query like SELECT avg(response_ms) only reads that specific column and doesn’t touch anything else. If your table has 20 columns but your query only uses 2 or 3, I/O drops by 6–10x immediately.
Three other factors compound the effect:
- Vectorized query execution: Instead of processing rows one at a time, ClickHouse processes them in batches of 8,192 rows by default, taking advantage of CPU SIMD instructions — which is why it’s fast even when data is already in RAM
- Data compression: Each column is compressed independently with an appropriate codec, typically shrinking data to 5–10x smaller than raw — less data to read means faster queries
- MergeTree engine: Automatically sorts and merges data in the background, making range scans by time or user_id extremely fast because data is already ordered
Choosing the Right Engine for Your Use Case
ClickHouse offers many table engines, but in 90% of cases you only need to know these three:
MergeTree — The Default Engine for Everything
CREATE TABLE events (
date Date,
user_id UInt64,
action String,
value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, user_id)
TTL date + INTERVAL 90 DAY; -- Automatically delete data older than 90 days
PARTITION BY splits data by month — queries filtered by month only scan the relevant partition and skip everything else entirely. TTL is a feature I use constantly for log data: no more cron jobs to clean up old records; ClickHouse handles it automatically.
ReplacingMergeTree — When You Need Upserts
CREATE TABLE user_profiles (
user_id UInt64,
name String,
email String,
updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;
This engine keeps the record with the most recent updated_at when multiple rows share the same ORDER BY key. One important caveat: deduplication happens asynchronously during background merges, not immediately. So when querying, add FINAL to ensure accurate results:
SELECT * FROM user_profiles FINAL WHERE user_id = 12345;
SummingMergeTree — Automatic Aggregation
CREATE TABLE daily_stats (
date Date,
page String,
views UInt64,
clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, page);
On each merge, ClickHouse automatically sums views and clicks for rows sharing the same (date, page) key into a single row. Instead of keeping hundreds of small rows, you end up with one pre-aggregated row — dashboard queries become noticeably faster.
Importing Real Data from CSV and MySQL
Import from a CSV file:
clickhouse-client --query="INSERT INTO access_logs FORMAT CSVWithNames" < access_logs.csv
Pull data from MySQL into ClickHouse — I use this approach often when migrating or syncing analytical data:
-- Create a source table from MySQL
CREATE TABLE mysql_orders
ENGINE = MySQL('mysql-host:3306', 'mydb', 'orders', 'user', 'password');
-- Copy to a local ClickHouse table
INSERT INTO ch_orders SELECT * FROM mysql_orders WHERE created_at >= '2025-01-01';
Practical Tips for Production Deployment
Configure Memory Limits
By default, ClickHouse uses RAM freely. On shared servers, you’ll want to set limits:
<!-- /etc/clickhouse-server/users.d/limits.xml -->
<clickhouse>
<profiles>
<default>
<max_memory_usage>8589934592</max_memory_usage> <!-- 8GB -->
<max_execution_time>60</max_execution_time> <!-- 60 seconds -->
</default>
</profiles>
</clickhouse>
Monitoring Slow Queries
-- View currently running queries
SELECT query_id, elapsed, query
FROM system.processes
ORDER BY elapsed DESC;
-- View query log (history)
SELECT query, query_duration_ms, read_rows, memory_usage
FROM system.query_log
WHERE event_date = today() AND query_duration_ms > 1000
ORDER BY query_duration_ms DESC
LIMIT 20;
HTTP Interface — Handy for Scripts and Monitoring
# Query via HTTP (port 8123)
curl 'http://localhost:8123/?query=SELECT+count()+FROM+access_logs'
# Or use with Python
pip install clickhouse-connect
import clickhouse_connect
client = clickhouse_connect.get_client(host='localhost', port=8123)
result = client.query('SELECT page, count() FROM access_logs GROUP BY page')
for row in result.result_rows:
print(row)
When NOT to Use ClickHouse
This is something I always make a point of clarifying because it comes up often: ClickHouse is not an OLTP database. If your workload involves:
- Frequent row-level updates or deletes
- ACID transactions
- Single-row lookups by primary key
Stick with PostgreSQL or MySQL. ClickHouse shines when you need to aggregate millions of rows fast — analytics dashboards, log analysis, time-series metrics, BI reports.
My current production stack: PostgreSQL for transactional data, ClickHouse for analytics — synced via Kafka or batch ETL every 5 minutes. Each tool does what it’s built for, and neither can replace the other.

