Getting Started with ClickHouse: A High-Performance OLAP Database for Large-Scale Analytics

Database tutorial - IT technology blog
Database tutorial - IT technology blog

I’ve worked with MySQL, PostgreSQL, and MongoDB across different projects — each has its strengths. But when our team needed to analyze 500 million access log rows per day, they all started to buckle. A GROUP BY query on PostgreSQL took 40 seconds; MySQL was even worse. Then I tried ClickHouse — that same query finished in 0.3 seconds. I never looked back.

Install and Run in 5 Minutes

Install ClickHouse on Ubuntu/Debian:

sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee /etc/apt/sources.list.d/clickhouse.list
sudo apt-get update
sudo apt-get install -y clickhouse-server clickhouse-client

Start the service and connect right away:

sudo systemctl start clickhouse-server
sudo systemctl enable clickhouse-server
clickhouse-client

Create a table and insert 1 million rows of dummy data:

CREATE TABLE access_logs (
    event_time DateTime,
    user_id UInt32,
    page String,
    status_code UInt16,
    response_ms UInt32
) ENGINE = MergeTree()
ORDER BY (event_time, user_id);

-- Insert 1 million test rows
INSERT INTO access_logs
SELECT
    now() - randBinomial(1000000, 0.5),
    rand() % 10000,
    arrayElement(['/home', '/api', '/login', '/product'], rand() % 4 + 1),
    arrayElement([200, 301, 404, 500], rand() % 4 + 1),
    rand() % 2000
FROM numbers(1000000);

Now run a test query:

SELECT page, count() AS hits, avg(response_ms) AS avg_ms
FROM access_logs
WHERE status_code = 200
GROUP BY page
ORDER BY hits DESC;

Results come back in under a second. That’s ClickHouse.

Why Is ClickHouse So Fast?

The secret is columnar storage. MySQL and PostgreSQL store data row by row — reading one row pulls in every column. ClickHouse does the opposite: it stores data column by column, so a query like SELECT avg(response_ms) only reads that specific column and doesn’t touch anything else. If your table has 20 columns but your query only uses 2 or 3, I/O drops by 6–10x immediately.

Three other factors compound the effect:

  • Vectorized query execution: Instead of processing rows one at a time, ClickHouse processes them in batches of 8,192 rows by default, taking advantage of CPU SIMD instructions — which is why it’s fast even when data is already in RAM
  • Data compression: Each column is compressed independently with an appropriate codec, typically shrinking data to 5–10x smaller than raw — less data to read means faster queries
  • MergeTree engine: Automatically sorts and merges data in the background, making range scans by time or user_id extremely fast because data is already ordered

Choosing the Right Engine for Your Use Case

ClickHouse offers many table engines, but in 90% of cases you only need to know these three:

MergeTree — The Default Engine for Everything

CREATE TABLE events (
    date Date,
    user_id UInt64,
    action String,
    value Float64
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, user_id)
TTL date + INTERVAL 90 DAY;  -- Automatically delete data older than 90 days

PARTITION BY splits data by month — queries filtered by month only scan the relevant partition and skip everything else entirely. TTL is a feature I use constantly for log data: no more cron jobs to clean up old records; ClickHouse handles it automatically.

ReplacingMergeTree — When You Need Upserts

CREATE TABLE user_profiles (
    user_id UInt64,
    name String,
    email String,
    updated_at DateTime
) ENGINE = ReplacingMergeTree(updated_at)
ORDER BY user_id;

This engine keeps the record with the most recent updated_at when multiple rows share the same ORDER BY key. One important caveat: deduplication happens asynchronously during background merges, not immediately. So when querying, add FINAL to ensure accurate results:

SELECT * FROM user_profiles FINAL WHERE user_id = 12345;

SummingMergeTree — Automatic Aggregation

CREATE TABLE daily_stats (
    date Date,
    page String,
    views UInt64,
    clicks UInt64
) ENGINE = SummingMergeTree()
ORDER BY (date, page);

On each merge, ClickHouse automatically sums views and clicks for rows sharing the same (date, page) key into a single row. Instead of keeping hundreds of small rows, you end up with one pre-aggregated row — dashboard queries become noticeably faster.

Importing Real Data from CSV and MySQL

Import from a CSV file:

clickhouse-client --query="INSERT INTO access_logs FORMAT CSVWithNames" < access_logs.csv

Pull data from MySQL into ClickHouse — I use this approach often when migrating or syncing analytical data:

-- Create a source table from MySQL
CREATE TABLE mysql_orders
ENGINE = MySQL('mysql-host:3306', 'mydb', 'orders', 'user', 'password');

-- Copy to a local ClickHouse table
INSERT INTO ch_orders SELECT * FROM mysql_orders WHERE created_at >= '2025-01-01';

Practical Tips for Production Deployment

Configure Memory Limits

By default, ClickHouse uses RAM freely. On shared servers, you’ll want to set limits:

<!-- /etc/clickhouse-server/users.d/limits.xml -->
<clickhouse>
    <profiles>
        <default>
            <max_memory_usage>8589934592</max_memory_usage> <!-- 8GB -->
            <max_execution_time>60</max_execution_time>    <!-- 60 seconds -->
        </default>
    </profiles>
</clickhouse>

Monitoring Slow Queries

-- View currently running queries
SELECT query_id, elapsed, query
FROM system.processes
ORDER BY elapsed DESC;

-- View query log (history)
SELECT query, query_duration_ms, read_rows, memory_usage
FROM system.query_log
WHERE event_date = today() AND query_duration_ms > 1000
ORDER BY query_duration_ms DESC
LIMIT 20;

HTTP Interface — Handy for Scripts and Monitoring

# Query via HTTP (port 8123)
curl 'http://localhost:8123/?query=SELECT+count()+FROM+access_logs'

# Or use with Python
pip install clickhouse-connect
import clickhouse_connect

client = clickhouse_connect.get_client(host='localhost', port=8123)
result = client.query('SELECT page, count() FROM access_logs GROUP BY page')
for row in result.result_rows:
    print(row)

When NOT to Use ClickHouse

This is something I always make a point of clarifying because it comes up often: ClickHouse is not an OLTP database. If your workload involves:

  • Frequent row-level updates or deletes
  • ACID transactions
  • Single-row lookups by primary key

Stick with PostgreSQL or MySQL. ClickHouse shines when you need to aggregate millions of rows fast — analytics dashboards, log analysis, time-series metrics, BI reports.

My current production stack: PostgreSQL for transactional data, ClickHouse for analytics — synced via Kafka or batch ETL every 5 minutes. Each tool does what it’s built for, and neither can replace the other.

Share: