Building a Data Streaming System with Apache Kafka and Node.js: From Zero to Production – ITFROMZERO

Table of Contents

When Do You Really Need Data Streaming?

When I first started out, I used to push everything directly into the database via REST APIs. Everything was fine until my project hit 50,000 concurrent users. Millions of logs, tracking events, and notifications flooded in every second, causing the server to struggle. Latency spiked from 200ms to 10s. The database kept throwing ‘Too many connections’ errors, and the system crashed completely just because of a surge in ‘Likes’.

The problem is: traditional databases aren’t designed for high-intensity, continuous write operations. We need a “black hole” capable of swallowing millions of messages per second, storing them safely, and then distributing them to processing services at their own pace. That’s where Apache Kafka shines.

Choosing Your Weapon: Kafka, RabbitMQ, or Redis Pub/Sub?

Before writing the first line of code, let’s look at the balance sheet for today’s popular Message Broker solutions.

1. Redis Pub/Sub

Strengths: Extremely fast since it runs on RAM, near-zero latency.
Weaknesses: Works on a “fire and forget” mechanism. If a Consumer (receiver) loses connection right when a message arrives, that data evaporates forever. Redis isn’t suitable if you need data guarantees.

2. RabbitMQ (Message Queue)

Strengths: Excellent at managing complex routing to ensure messages reach the right destination.
Weaknesses: When the queue builds up to a few million unprocessed messages, performance starts to degrade significantly. It prioritizes message delivery over high-throughput data streams.

3. Apache Kafka (Event Streaming)

Strengths: Kafka writes data to disk as an append-only log, making it incredibly durable. You can easily “replay” data from 3 days ago for debugging. Regarding scalability, giants like Netflix and Uber use Kafka to process trillions of events daily.
Weaknesses: A steeper learning curve and higher operational resource requirements compared to Redis.

Advice: If your system requires absolute reliability, data playback capabilities, and unlimited scalability, go with Kafka.

Quick Deployment with Docker

Installing Kafka manually can be frustrating due to Java and Zookeeper dependencies. Using Docker is the fastest way to get started. You can have a production-ready environment in less than 30 seconds.

Create a docker-compose.yml file:

version: '3'
services:
  zookeeper:
    image: confluentinc/cp-zookeeper:latest
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181

  kafka:
    image: confluentinc/cp-kafka:latest
    depends_on:
      - zookeeper
    ports:
      - "9092:9092"
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Run the command docker-compose up -d. Now, your computer has become a real Kafka Broker.

Coding Producer and Consumer with Node.js

In the Node.js ecosystem, the kafkajs library is the top contender. It’s lightweight, written entirely in JavaScript, and offers impressive performance.

Step 1: Project Setup

mkdir kafka-demo && cd kafka-demo
npm init -y
npm install kafkajs

Step 2: Write the Producer

A Producer is like a reporter sending news to a newsroom. Here, I’m simulating sending a successful order event.

const { Kafka } = require('kafkajs');

const kafka = new Kafka({ clientId: 'order-service', brokers: ['localhost:9092'] });
const producer = kafka.producer();

(async () => {
  await producer.connect();
  console.log("✅ Producer is ready");

  await producer.send({
    topic: 'orders',
    messages: [
      { value: JSON.stringify({ id: 1, item: 'MacBook M3', price: 2500 }) },
    ],
  });

  await producer.disconnect();
})();

Step 3: Write the Consumer

The Consumer stands by to process messages as soon as they appear in the Topic.

const { Kafka } = require('kafkajs');

const kafka = new Kafka({ clientId: 'inventory-service', brokers: ['localhost:9092'] });
const consumer = kafka.consumer({ groupId: 'inventory-group' });

(async () => {
  await consumer.connect();
  await consumer.subscribe({ topic: 'orders', fromBeginning: true });

  await consumer.run({
    eachMessage: async ({ message }) => {
      const order = JSON.parse(message.value.toString());
      console.log(`📦 Processing inventory for order: ${order.id}`);
    },
  });
})();

3 Hard-Won Lessons from Real-World Deployment

After managing systems processing over 500GB of data daily, I realized that getting the code to run is only 30% of the journey. The other 70% lies in operational optimization.

1. Never use 1 Partition for Production

Kafka parallelizes processing through Partitions. If a Topic has only 1 partition, you can only use a single Consumer to read data. Even if you scale up to 10 servers, 9 will sit idle. Start with at least 3 or 6 partitions to make scaling out easier later.

2. Handle Idempotency

In distributed systems, a Consumer receiving a duplicate message is inevitable (due to network timeouts or rebalancing).
Solution: Always check the state in the database before processing. For example: If this order_id has already been deducted from inventory, skip it and don’t deduct it again.

3. Always Monitor Consumer Lag

Consumer Lag tells you how many messages you are behind real-time data. If this number reaches hundreds of thousands, it means your processing code is too slow or the data volume is too high. Don’t wait until the disk is full (Retention Policy) to discover that a Consumer stopped running 2 days ago.

Conclusion

Kafka isn’t just a technology; it changes how we think about software: moving from Request-Response to Event-Driven. Node.js, with its non-blocking mechanism, is a perfect fit for Kafka Consumers thanks to its smooth I/O handling.

If you’re building a small app, don’t overcomplicate things. But if your goal is to serve millions of users with low latency, Kafka is the key you need to master.