Real Time Data Processing Guide | Pillai Infotech LLP

Q: Do I need Kafka for real-time processing?

Not always. Redis Streams or PostgreSQL LISTEN/NOTIFY work for simple cases. Kafka shines at millions of events/sec with multi-consumer replay needs.

Q: How many Kafka partitions should I use?

Start with 6-12 for most topics, scaling to 30-50 for high-throughput. Match the number to your largest consumer group size.

Q: Flink or Spark Structured Streaming?

Flink for true real-time sub-second latency. Spark Streaming if your team already uses Spark and can tolerate 100ms+ latency.

Q: How do I handle late-arriving events?

Use watermarks and allowed lateness. Flink handles this natively with bounded out-of-orderness watermark strategies.

Q: What's the cost of running a Kafka cluster?

Self-managed: 3-5 brokers at $500-1,500/month each. Managed: Confluent Cloud from ~$1/hour, MSK from ~$0.25/broker-hour.

In This Guide

1. Why Real-Time? Batch vs Stream Processing
2. Apache Kafka — The Event Backbone
3. Stream Processing Engines — Flink, Spark, ksqlDB
4. Stream Processing Patterns
5. Exactly-Once Semantics — The Hardest Problem
6. Reference Architectures
7. Operational Considerations
8. Frequently Asked Questions

When a customer places an order, the fraud detection system can't wait for tonight's batch job. When a sensor reading spikes, the alert can't come tomorrow morning. Real-time data processing has gone from a nice-to-have to a requirement for any system where latency matters — and that's most systems in 2026.

This guide covers the full stack: message brokers (Kafka), stream processors (Flink), processing patterns, delivery guarantees, and battle-tested architectures. We'll skip the theory-heavy parts and focus on what you need to build and run these systems.

1. Why Real-Time? Batch vs Stream Processing

Factor	Batch Processing	Stream Processing
Latency	Minutes to hours	Milliseconds to seconds
Data model	Bounded datasets (files, tables)	Unbounded event streams
Processing	Run periodically (cron, Airflow)	Runs continuously
State	Stateless (recompute from source)	Stateful (maintains running state)
Complexity	Low — familiar SQL/Python	High — windowing, ordering, late data
Cost	Pay for compute when running	Pay for always-on infrastructure
Best for	Reporting, ML training, ETL	Fraud detection, monitoring, real-time dashboards

Don't Force Real-Time: If your stakeholders check a dashboard once a day, a nightly batch job is the right answer — simpler, cheaper, more reliable. Stream processing shines when the business value degrades with latency: fraud that's caught in 100ms vs 1 hour, recommendations that reflect a click from 2 seconds ago vs yesterday.

2. Apache Kafka — The Event Backbone

Kafka is a distributed event streaming platform. It's the backbone of most real-time architectures — not because it processes data, but because it durably stores and distributes events between producers and consumers.

Kafka Architecture — Key Concepts

Producer (Order Service)
    │
    ▼
┌──────────────────────────────────┐
│  Topic: "orders"                  │
│  ┌──────────┬──────────┬────────┐│
│  │Partition 0│Partition 1│Partition 2││
│  │ msg1,msg4 │ msg2,msg5 │ msg3,msg6 ││
│  └──────────┴──────────┴────────┘│
│  Retention: 7 days (configurable) │
└──────────────────────────────────┘
    │              │              │
    ▼              ▼              ▼
Consumer Group A        Consumer Group B
(Fraud Detection)       (Analytics Pipeline)
 ├─ Consumer 1 ←─ P0    ├─ Consumer 1 ←─ P0, P1
 ├─ Consumer 2 ←─ P1    └─ Consumer 2 ←─ P2
 └─ Consumer 3 ←─ P2

Key insight: Each consumer group tracks its own offset independently.
Group A can be real-time; Group B can be behind by hours — both are fine.

Kafka Producer — Java Example

// Producer with idempotent writes (prevents duplicates on retry)
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092");
props.put("enable.idempotence", "true");       // Exactly-once producer
props.put("acks", "all");                       // Wait for all replicas
props.put("key.serializer", StringSerializer.class);
props.put("value.serializer", JsonSerializer.class);

KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);

// Key determines partition — same customer_id always goes to same partition
// This guarantees ordering per customer
OrderEvent event = new OrderEvent("ord-12345", "customer-789", "PLACED", items);
producer.send(new ProducerRecord<>("orders", event.customerId(), event),
    (metadata, exception) -> {
        if (exception != null) log.error("Failed to send", exception);
        else log.info("Sent to partition {} offset {}", metadata.partition(), metadata.offset());
    });

// Python equivalent (confluent-kafka)
from confluent_kafka import Producer

producer = Producer({'bootstrap.servers': 'kafka-1:9092',
                     'enable.idempotence': True})
producer.produce('orders', key='customer-789',
                 value=json.dumps(order_event).encode())

Kafka Configuration	Low Latency	High Throughput	High Durability
acks	1	0 or 1	all
batch.size	1 (no batching)	64KB - 1MB	16KB
linger.ms	0	50-200	5
compression	none	lz4 or zstd	zstd
replication.factor	2	2	3

Kafka Alternatives

Platform	Best For	Advantage Over Kafka
Apache Pulsar	Multi-tenancy, geo-replication	Separate storage/compute, native multi-tenancy
Amazon Kinesis	AWS-native, serverless	Zero ops, auto-scaling
Redpanda	Kafka API-compatible, lower latency	No JVM (C++), lower tail latency, simpler ops
WarpStream	Cost-sensitive Kafka workloads	No inter-zone networking costs, object storage backend

3. Stream Processing Engines — Flink, Spark, ksqlDB

Kafka moves data. Stream processors transform it. Here's how the major engines compare.

Engine	Latency	State	Best For	Learning Curve
Apache Flink	< 10ms	Rich (RocksDB-backed)	Complex event processing, large state	Steep
Spark Structured Streaming	100ms - seconds	Good (checkpoint-based)	Teams already using Spark for batch	Moderate
ksqlDB	< 100ms	Good (Kafka Streams-backed)	SQL-first stream processing	Low (SQL)
Kafka Streams	< 10ms	Good (RocksDB)	Java/Kotlin microservices	Moderate

Apache Flink — Fraud Detection Example

// Flink DataStream API — detect rapid transactions from same card
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>("transactions",
        new TransactionDeserializer(), kafkaProps));

// Pattern: 3+ transactions within 60 seconds from same card
Pattern<Transaction, ?> fraudPattern = Pattern.<Transaction>begin("first")
    .where(new SimpleCondition<Transaction>() {
        public boolean filter(Transaction t) { return t.getAmount() > 100; }
    })
    .followedBy("second").where(/* same card */)
    .followedBy("third").where(/* same card */)
    .within(Time.seconds(60));

PatternStream<Transaction> patternStream = CEP.pattern(
    transactions.keyBy(Transaction::getCardId), fraudPattern);

patternStream.select(new PatternSelectFunction<Transaction, FraudAlert>() {
    public FraudAlert select(Map<String, List<Transaction>> pattern) {
        return new FraudAlert(pattern.get("first").get(0).getCardId(),
            "3 transactions > $100 within 60 seconds");
    }
}).addSink(new FlinkKafkaProducer<>("fraud-alerts", ...));

env.execute("Fraud Detection Pipeline");

ksqlDB — Stream Processing with SQL

-- Create a stream from a Kafka topic
CREATE STREAM orders (
    order_id VARCHAR KEY,
    customer_id VARCHAR,
    amount DECIMAL(10,2),
    status VARCHAR,
    created_at TIMESTAMP
) WITH (
    kafka_topic = 'orders',
    value_format = 'JSON',
    timestamp = 'created_at'
);

-- Real-time aggregation: revenue per customer in sliding 1-hour window
CREATE TABLE customer_hourly_spend AS
SELECT customer_id,
       SUM(amount) AS total_spent,
       COUNT(*) AS order_count
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY customer_id
HAVING SUM(amount) > 1000
EMIT CHANGES;

-- This automatically produces to a Kafka topic
-- Downstream services consume it for real-time alerts

4. Stream Processing Patterns

Pattern	Description	Use Case
Tumbling Window	Fixed-size, non-overlapping time windows	Hourly revenue totals
Sliding Window	Overlapping windows that slide by an interval	Moving average, rate limiting
Session Window	Gap-based windows (close after inactivity)	User sessions, click streams
Stream-Stream Join	Join two streams within a time window	Match orders with payments
Stream-Table Join	Enrich stream events with reference data	Enrich orders with customer details
CDC (Change Data Capture)	Capture database changes as events	Sync databases, build materialized views

CDC with Debezium — Database to Kafka

// Debezium connector config — captures every INSERT/UPDATE/DELETE from PostgreSQL
{
    "name": "orders-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "orders-db.internal",
        "database.port": "5432",
        "database.dbname": "orders",
        "database.user": "debezium",
        "table.include.list": "public.orders,public.order_items",
        "topic.prefix": "cdc",
        "slot.name": "debezium_orders",
        "plugin.name": "pgoutput"
    }
}

// Result: Every change in orders table → Kafka topic "cdc.public.orders"
// Event payload includes: before (old values), after (new values), operation (c/u/d)
// Downstream consumers get real-time database change events

5. Exactly-Once Semantics — The Hardest Problem

In distributed systems, messages can be lost (at-most-once), duplicated (at-least-once), or processed exactly once. Exactly-once is what everyone wants and the hardest to achieve.

Guarantee	Behavior	Use When	Performance
At-most-once	Fire and forget — may lose messages	Metrics, logs, non-critical events	Fastest
At-least-once	Retry on failure — may duplicate	Most workloads (with idempotent consumers)	Fast
Exactly-once	Each message processed exactly once	Financial transactions, billing	Slowest (transactions)

The Practical Approach: True exactly-once is expensive and slow. In most cases, at-least-once + idempotent consumers is the right tradeoff. Make your consumers handle duplicates gracefully (use idempotency keys, upserts, or deduplication windows) and you get effectively-exactly-once at a fraction of the cost.

6. Reference Architectures

Lambda Architecture (Batch + Stream)

                    ┌─── Batch Layer (Spark) ──→ Batch Views ───┐
Raw Events ──→ Kafka ─┤                                           ├──→ Serving Layer
                    └─── Speed Layer (Flink) ──→ Real-Time Views ─┘

Pros: Accurate (batch corrects streaming approximations)
Cons: Maintaining two codepaths, double the complexity

Kappa Architecture (Stream Only)

Raw Events ──→ Kafka ──→ Stream Processor (Flink) ──→ Serving Layer

Reprocessing: Reset consumer offset → replay from Kafka
Pros: Single codebase, simpler ops
Cons: Kafka retention must be long enough for full reprocessing

Our recommendation: Start with Kappa. Lambda only when you genuinely need batch-level accuracy that streaming can't provide (rare in 2026 with Flink's exactly-once guarantees).

Event-Driven Microservices

Order Service ──→ "order.placed" ──→ Kafka ──┬──→ Payment Service
                                              ├──→ Inventory Service
                                              ├──→ Notification Service
                                              └──→ Analytics Pipeline

Each service:
- Consumes events it cares about
- Produces events for downstream services
- Owns its own database (no shared state)
- Can be independently scaled and deployed

7. Operational Considerations

Concern	Solution	Tools
Consumer lag monitoring	Alert when lag exceeds threshold	Burrow, Kafka Exporter + Prometheus
Schema evolution	Enforce compatible schema changes	Schema Registry (Avro/Protobuf)
Dead letter queues	Route failed messages for investigation	Dedicated DLQ topics
Backpressure	Slow down producers when consumers lag	Flink built-in, Kafka quotas
Data lineage	Track data flow through pipelines	OpenLineage, DataHub, Marquez

What We've Learned Running Kafka in Production: The hardest part isn't setting up Kafka — it's operating it. Partition rebalancing during deploys, consumer group management, and topic retention policies are where teams struggle. Start with a managed service (Confluent Cloud, Amazon MSK, Redpanda Cloud) unless you have dedicated platform engineers. The operational overhead of self-managed Kafka is real — we've seen teams spend 40% of their platform engineering time on Kafka operations alone.

Frequently Asked Questions

Do I need Kafka for real-time processing?

Not always. For simple use cases (< 10K events/sec, single consumer), Redis Streams or even a PostgreSQL LISTEN/NOTIFY can work. Kafka shines at scale (millions of events/sec), multi-consumer scenarios, and when you need durable event replay. Don't bring Kafka into a project that could use a simple message queue.

How many Kafka partitions should I use?

Start with the number of consumers you expect in your largest consumer group. More partitions = more parallelism, but also more overhead. A good starting point is 6-12 partitions for most topics, scaling to 30-50 for high-throughput topics. You can increase partitions later but not decrease them.

Flink or Spark Structured Streaming?

Flink for true real-time (sub-second latency, complex event processing, large stateful operations). Spark Streaming if your team already uses Spark for batch and you can tolerate 100ms+ latency. Flink is the better stream processor; Spark is the better unified batch+stream platform.

How do I handle late-arriving events?

Use watermarks and allowed lateness. Watermarks tell the processor "all events before this timestamp have arrived." Allowed lateness keeps windows open for stragglers. Flink handles this natively — configure a watermark strategy with bounded out-of-orderness (e.g., 30 seconds) and set allowed lateness based on your business requirements.

What's the cost of running a Kafka cluster?

Self-managed: 3-5 brokers at ~$500-1,500/month each (cloud VMs) plus storage. Managed services: Confluent Cloud starts at ~$1/hour for basic clusters; Amazon MSK at ~$0.25/broker-hour. For most teams, managed services cost 20-30% more but save significant engineering time. Factor in the cost of a dedicated Kafka engineer ($150K+ salary) when comparing.

⚡

Pillai Infotech LLP

We design and build real-time data pipelines — from Kafka deployment to Flink stream processing. Let's architect your streaming infrastructure.

Data Engineering Fundamentals: Building Modern Data Pipelines → Database Scaling Strategies: Sharding, Replication, and Caching → Redis Caching Patterns: Beyond Simple Key-Value →

Real-Time Data Processing: Kafka, Flink, and Stream Architecture

In This Guide

1. Why Real-Time? Batch vs Stream Processing

2. Apache Kafka — The Event Backbone

Kafka Architecture — Key Concepts

Kafka Producer — Java Example

Kafka Alternatives

3. Stream Processing Engines — Flink, Spark, ksqlDB

Apache Flink — Fraud Detection Example

ksqlDB — Stream Processing with SQL

4. Stream Processing Patterns

CDC with Debezium — Database to Kafka

5. Exactly-Once Semantics — The Hardest Problem

6. Reference Architectures

Lambda Architecture (Batch + Stream)

Kappa Architecture (Stream Only)

Event-Driven Microservices

7. Operational Considerations

Frequently Asked Questions

Do I need Kafka for real-time processing?

How many Kafka partitions should I use?

Flink or Spark Structured Streaming?

How do I handle late-arriving events?

What's the cost of running a Kafka cluster?

Pillai Infotech LLP

Related Articles

Real-Time Data Processing: Kafka, Flink, and Stream Architecture

In This Guide

1. Why Real-Time? Batch vs Stream Processing

2. Apache Kafka — The Event Backbone

Kafka Architecture — Key Concepts

Kafka Producer — Java Example

Kafka Alternatives

3. Stream Processing Engines — Flink, Spark, ksqlDB

Apache Flink — Fraud Detection Example

ksqlDB — Stream Processing with SQL

4. Stream Processing Patterns

CDC with Debezium — Database to Kafka

5. Exactly-Once Semantics — The Hardest Problem

6. Reference Architectures

Lambda Architecture (Batch + Stream)

Kappa Architecture (Stream Only)

Event-Driven Microservices

7. Operational Considerations

Frequently Asked Questions

Do I need Kafka for real-time processing?

How many Kafka partitions should I use?

Flink or Spark Structured Streaming?

How do I handle late-arriving events?

What's the cost of running a Kafka cluster?

Pillai Infotech LLP

Related Articles

Book a Free Consultation

Your Details

Pick a 30-min Slot

Thank You!