Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
Data & Analytics

Real-Time Data Processing: Kafka, Flink, and Stream Architecture

Batch processing means your data is always hours old. Stream processing means it's seconds old. Here's how to build real-time pipelines that actually work in production.

⚡ Database & Data February 8, 2026 14 min read

In This Guide

When a customer places an order, the fraud detection system can't wait for tonight's batch job. When a sensor reading spikes, the alert can't come tomorrow morning. Real-time data processing has gone from a nice-to-have to a requirement for any system where latency matters — and that's most systems in 2026.

This guide covers the full stack: message brokers (Kafka), stream processors (Flink), processing patterns, delivery guarantees, and battle-tested architectures. We'll skip the theory-heavy parts and focus on what you need to build and run these systems.

1. Why Real-Time? Batch vs Stream Processing

Factor Batch Processing Stream Processing
LatencyMinutes to hoursMilliseconds to seconds
Data modelBounded datasets (files, tables)Unbounded event streams
ProcessingRun periodically (cron, Airflow)Runs continuously
StateStateless (recompute from source)Stateful (maintains running state)
ComplexityLow — familiar SQL/PythonHigh — windowing, ordering, late data
CostPay for compute when runningPay for always-on infrastructure
Best forReporting, ML training, ETLFraud detection, monitoring, real-time dashboards
Don't Force Real-Time: If your stakeholders check a dashboard once a day, a nightly batch job is the right answer — simpler, cheaper, more reliable. Stream processing shines when the business value degrades with latency: fraud that's caught in 100ms vs 1 hour, recommendations that reflect a click from 2 seconds ago vs yesterday.

2. Apache Kafka — The Event Backbone

Kafka is a distributed event streaming platform. It's the backbone of most real-time architectures — not because it processes data, but because it durably stores and distributes events between producers and consumers.

Kafka Architecture — Key Concepts

Producer (Order Service)
    │
    ▼
┌──────────────────────────────────┐
│  Topic: "orders"                  │
│  ┌──────────┬──────────┬────────┐│
│  │Partition 0│Partition 1│Partition 2││
│  │ msg1,msg4 │ msg2,msg5 │ msg3,msg6 ││
│  └──────────┴──────────┴────────┘│
│  Retention: 7 days (configurable) │
└──────────────────────────────────┘
    │              │              │
    ▼              ▼              ▼
Consumer Group A        Consumer Group B
(Fraud Detection)       (Analytics Pipeline)
 ├─ Consumer 1 ←─ P0    ├─ Consumer 1 ←─ P0, P1
 ├─ Consumer 2 ←─ P1    └─ Consumer 2 ←─ P2
 └─ Consumer 3 ←─ P2

Key insight: Each consumer group tracks its own offset independently.
Group A can be real-time; Group B can be behind by hours — both are fine.

Kafka Producer — Java Example

// Producer with idempotent writes (prevents duplicates on retry)
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092");
props.put("enable.idempotence", "true");       // Exactly-once producer
props.put("acks", "all");                       // Wait for all replicas
props.put("key.serializer", StringSerializer.class);
props.put("value.serializer", JsonSerializer.class);

KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);

// Key determines partition — same customer_id always goes to same partition
// This guarantees ordering per customer
OrderEvent event = new OrderEvent("ord-12345", "customer-789", "PLACED", items);
producer.send(new ProducerRecord<>("orders", event.customerId(), event),
    (metadata, exception) -> {
        if (exception != null) log.error("Failed to send", exception);
        else log.info("Sent to partition {} offset {}", metadata.partition(), metadata.offset());
    });

// Python equivalent (confluent-kafka)
from confluent_kafka import Producer

producer = Producer({'bootstrap.servers': 'kafka-1:9092',
                     'enable.idempotence': True})
producer.produce('orders', key='customer-789',
                 value=json.dumps(order_event).encode())
Kafka Configuration Low Latency High Throughput High Durability
acks10 or 1all
batch.size1 (no batching)64KB - 1MB16KB
linger.ms050-2005
compressionnonelz4 or zstdzstd
replication.factor223

Kafka Alternatives

Platform Best For Advantage Over Kafka
Apache PulsarMulti-tenancy, geo-replicationSeparate storage/compute, native multi-tenancy
Amazon KinesisAWS-native, serverlessZero ops, auto-scaling
RedpandaKafka API-compatible, lower latencyNo JVM (C++), lower tail latency, simpler ops
WarpStreamCost-sensitive Kafka workloadsNo inter-zone networking costs, object storage backend

3. Stream Processing Engines — Flink, Spark, ksqlDB

Kafka moves data. Stream processors transform it. Here's how the major engines compare.

Engine Latency State Best For Learning Curve
Apache Flink< 10msRich (RocksDB-backed)Complex event processing, large stateSteep
Spark Structured Streaming100ms - secondsGood (checkpoint-based)Teams already using Spark for batchModerate
ksqlDB< 100msGood (Kafka Streams-backed)SQL-first stream processingLow (SQL)
Kafka Streams< 10msGood (RocksDB)Java/Kotlin microservicesModerate

Apache Flink — Fraud Detection Example

// Flink DataStream API — detect rapid transactions from same card
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Transaction> transactions = env
    .addSource(new FlinkKafkaConsumer<>("transactions",
        new TransactionDeserializer(), kafkaProps));

// Pattern: 3+ transactions within 60 seconds from same card
Pattern<Transaction, ?> fraudPattern = Pattern.<Transaction>begin("first")
    .where(new SimpleCondition<Transaction>() {
        public boolean filter(Transaction t) { return t.getAmount() > 100; }
    })
    .followedBy("second").where(/* same card */)
    .followedBy("third").where(/* same card */)
    .within(Time.seconds(60));

PatternStream<Transaction> patternStream = CEP.pattern(
    transactions.keyBy(Transaction::getCardId), fraudPattern);

patternStream.select(new PatternSelectFunction<Transaction, FraudAlert>() {
    public FraudAlert select(Map<String, List<Transaction>> pattern) {
        return new FraudAlert(pattern.get("first").get(0).getCardId(),
            "3 transactions > $100 within 60 seconds");
    }
}).addSink(new FlinkKafkaProducer<>("fraud-alerts", ...));

env.execute("Fraud Detection Pipeline");

ksqlDB — Stream Processing with SQL

-- Create a stream from a Kafka topic
CREATE STREAM orders (
    order_id VARCHAR KEY,
    customer_id VARCHAR,
    amount DECIMAL(10,2),
    status VARCHAR,
    created_at TIMESTAMP
) WITH (
    kafka_topic = 'orders',
    value_format = 'JSON',
    timestamp = 'created_at'
);

-- Real-time aggregation: revenue per customer in sliding 1-hour window
CREATE TABLE customer_hourly_spend AS
SELECT customer_id,
       SUM(amount) AS total_spent,
       COUNT(*) AS order_count
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY customer_id
HAVING SUM(amount) > 1000
EMIT CHANGES;

-- This automatically produces to a Kafka topic
-- Downstream services consume it for real-time alerts

4. Stream Processing Patterns

Pattern Description Use Case
Tumbling WindowFixed-size, non-overlapping time windowsHourly revenue totals
Sliding WindowOverlapping windows that slide by an intervalMoving average, rate limiting
Session WindowGap-based windows (close after inactivity)User sessions, click streams
Stream-Stream JoinJoin two streams within a time windowMatch orders with payments
Stream-Table JoinEnrich stream events with reference dataEnrich orders with customer details
CDC (Change Data Capture)Capture database changes as eventsSync databases, build materialized views

CDC with Debezium — Database to Kafka

// Debezium connector config — captures every INSERT/UPDATE/DELETE from PostgreSQL
{
    "name": "orders-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "orders-db.internal",
        "database.port": "5432",
        "database.dbname": "orders",
        "database.user": "debezium",
        "table.include.list": "public.orders,public.order_items",
        "topic.prefix": "cdc",
        "slot.name": "debezium_orders",
        "plugin.name": "pgoutput"
    }
}

// Result: Every change in orders table → Kafka topic "cdc.public.orders"
// Event payload includes: before (old values), after (new values), operation (c/u/d)
// Downstream consumers get real-time database change events

5. Exactly-Once Semantics — The Hardest Problem

In distributed systems, messages can be lost (at-most-once), duplicated (at-least-once), or processed exactly once. Exactly-once is what everyone wants and the hardest to achieve.

Guarantee Behavior Use When Performance
At-most-onceFire and forget — may lose messagesMetrics, logs, non-critical eventsFastest
At-least-onceRetry on failure — may duplicateMost workloads (with idempotent consumers)Fast
Exactly-onceEach message processed exactly onceFinancial transactions, billingSlowest (transactions)
The Practical Approach: True exactly-once is expensive and slow. In most cases, at-least-once + idempotent consumers is the right tradeoff. Make your consumers handle duplicates gracefully (use idempotency keys, upserts, or deduplication windows) and you get effectively-exactly-once at a fraction of the cost.

6. Reference Architectures

Lambda Architecture (Batch + Stream)

                    ┌─── Batch Layer (Spark) ──→ Batch Views ───┐
Raw Events ──→ Kafka ─┤                                           ├──→ Serving Layer
                    └─── Speed Layer (Flink) ──→ Real-Time Views ─┘

Pros: Accurate (batch corrects streaming approximations)
Cons: Maintaining two codepaths, double the complexity

Kappa Architecture (Stream Only)

Raw Events ──→ Kafka ──→ Stream Processor (Flink) ──→ Serving Layer

Reprocessing: Reset consumer offset → replay from Kafka
Pros: Single codebase, simpler ops
Cons: Kafka retention must be long enough for full reprocessing

Our recommendation: Start with Kappa. Lambda only when you genuinely need batch-level accuracy that streaming can't provide (rare in 2026 with Flink's exactly-once guarantees).

Event-Driven Microservices

Order Service ──→ "order.placed" ──→ Kafka ──┬──→ Payment Service
                                              ├──→ Inventory Service
                                              ├──→ Notification Service
                                              └──→ Analytics Pipeline

Each service:
- Consumes events it cares about
- Produces events for downstream services
- Owns its own database (no shared state)
- Can be independently scaled and deployed

7. Operational Considerations

Concern Solution Tools
Consumer lag monitoringAlert when lag exceeds thresholdBurrow, Kafka Exporter + Prometheus
Schema evolutionEnforce compatible schema changesSchema Registry (Avro/Protobuf)
Dead letter queuesRoute failed messages for investigationDedicated DLQ topics
BackpressureSlow down producers when consumers lagFlink built-in, Kafka quotas
Data lineageTrack data flow through pipelinesOpenLineage, DataHub, Marquez
What We've Learned Running Kafka in Production: The hardest part isn't setting up Kafka — it's operating it. Partition rebalancing during deploys, consumer group management, and topic retention policies are where teams struggle. Start with a managed service (Confluent Cloud, Amazon MSK, Redpanda Cloud) unless you have dedicated platform engineers. The operational overhead of self-managed Kafka is real — we've seen teams spend 40% of their platform engineering time on Kafka operations alone.

Frequently Asked Questions

Do I need Kafka for real-time processing?

Not always. For simple use cases (< 10K events/sec, single consumer), Redis Streams or even a PostgreSQL LISTEN/NOTIFY can work. Kafka shines at scale (millions of events/sec), multi-consumer scenarios, and when you need durable event replay. Don't bring Kafka into a project that could use a simple message queue.

How many Kafka partitions should I use?

Start with the number of consumers you expect in your largest consumer group. More partitions = more parallelism, but also more overhead. A good starting point is 6-12 partitions for most topics, scaling to 30-50 for high-throughput topics. You can increase partitions later but not decrease them.

Flink or Spark Structured Streaming?

Flink for true real-time (sub-second latency, complex event processing, large stateful operations). Spark Streaming if your team already uses Spark for batch and you can tolerate 100ms+ latency. Flink is the better stream processor; Spark is the better unified batch+stream platform.

How do I handle late-arriving events?

Use watermarks and allowed lateness. Watermarks tell the processor "all events before this timestamp have arrived." Allowed lateness keeps windows open for stragglers. Flink handles this natively — configure a watermark strategy with bounded out-of-orderness (e.g., 30 seconds) and set allowed lateness based on your business requirements.

What's the cost of running a Kafka cluster?

Self-managed: 3-5 brokers at ~$500-1,500/month each (cloud VMs) plus storage. Managed services: Confluent Cloud starts at ~$1/hour for basic clusters; Amazon MSK at ~$0.25/broker-hour. For most teams, managed services cost 20-30% more but save significant engineering time. Factor in the cost of a dedicated Kafka engineer ($150K+ salary) when comparing.

Pillai Infotech LLP

We design and build real-time data pipelines — from Kafka deployment to Flink stream processing. Let's architect your streaming infrastructure.

Related Articles

Data Engineering Fundamentals: Building Modern Data Pipelines → Database Scaling Strategies: Sharding, Replication, and Caching → Redis Caching Patterns: Beyond Simple Key-Value →