In This Guide
When a customer places an order, the fraud detection system can't wait for tonight's batch job. When a sensor reading spikes, the alert can't come tomorrow morning. Real-time data processing has gone from a nice-to-have to a requirement for any system where latency matters — and that's most systems in 2026.
This guide covers the full stack: message brokers (Kafka), stream processors (Flink), processing patterns, delivery guarantees, and battle-tested architectures. We'll skip the theory-heavy parts and focus on what you need to build and run these systems.
1. Why Real-Time? Batch vs Stream Processing
| Factor | Batch Processing | Stream Processing |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data model | Bounded datasets (files, tables) | Unbounded event streams |
| Processing | Run periodically (cron, Airflow) | Runs continuously |
| State | Stateless (recompute from source) | Stateful (maintains running state) |
| Complexity | Low — familiar SQL/Python | High — windowing, ordering, late data |
| Cost | Pay for compute when running | Pay for always-on infrastructure |
| Best for | Reporting, ML training, ETL | Fraud detection, monitoring, real-time dashboards |
2. Apache Kafka — The Event Backbone
Kafka is a distributed event streaming platform. It's the backbone of most real-time architectures — not because it processes data, but because it durably stores and distributes events between producers and consumers.
Kafka Architecture — Key Concepts
Producer (Order Service)
│
▼
┌──────────────────────────────────┐
│ Topic: "orders" │
│ ┌──────────┬──────────┬────────┐│
│ │Partition 0│Partition 1│Partition 2││
│ │ msg1,msg4 │ msg2,msg5 │ msg3,msg6 ││
│ └──────────┴──────────┴────────┘│
│ Retention: 7 days (configurable) │
└──────────────────────────────────┘
│ │ │
▼ ▼ ▼
Consumer Group A Consumer Group B
(Fraud Detection) (Analytics Pipeline)
├─ Consumer 1 ←─ P0 ├─ Consumer 1 ←─ P0, P1
├─ Consumer 2 ←─ P1 └─ Consumer 2 ←─ P2
└─ Consumer 3 ←─ P2
Key insight: Each consumer group tracks its own offset independently.
Group A can be real-time; Group B can be behind by hours — both are fine.
Kafka Producer — Java Example
// Producer with idempotent writes (prevents duplicates on retry)
Properties props = new Properties();
props.put("bootstrap.servers", "kafka-1:9092,kafka-2:9092,kafka-3:9092");
props.put("enable.idempotence", "true"); // Exactly-once producer
props.put("acks", "all"); // Wait for all replicas
props.put("key.serializer", StringSerializer.class);
props.put("value.serializer", JsonSerializer.class);
KafkaProducer<String, OrderEvent> producer = new KafkaProducer<>(props);
// Key determines partition — same customer_id always goes to same partition
// This guarantees ordering per customer
OrderEvent event = new OrderEvent("ord-12345", "customer-789", "PLACED", items);
producer.send(new ProducerRecord<>("orders", event.customerId(), event),
(metadata, exception) -> {
if (exception != null) log.error("Failed to send", exception);
else log.info("Sent to partition {} offset {}", metadata.partition(), metadata.offset());
});
// Python equivalent (confluent-kafka)
from confluent_kafka import Producer
producer = Producer({'bootstrap.servers': 'kafka-1:9092',
'enable.idempotence': True})
producer.produce('orders', key='customer-789',
value=json.dumps(order_event).encode())
| Kafka Configuration | Low Latency | High Throughput | High Durability |
|---|---|---|---|
| acks | 1 | 0 or 1 | all |
| batch.size | 1 (no batching) | 64KB - 1MB | 16KB |
| linger.ms | 0 | 50-200 | 5 |
| compression | none | lz4 or zstd | zstd |
| replication.factor | 2 | 2 | 3 |
Kafka Alternatives
| Platform | Best For | Advantage Over Kafka |
|---|---|---|
| Apache Pulsar | Multi-tenancy, geo-replication | Separate storage/compute, native multi-tenancy |
| Amazon Kinesis | AWS-native, serverless | Zero ops, auto-scaling |
| Redpanda | Kafka API-compatible, lower latency | No JVM (C++), lower tail latency, simpler ops |
| WarpStream | Cost-sensitive Kafka workloads | No inter-zone networking costs, object storage backend |
3. Stream Processing Engines — Flink, Spark, ksqlDB
Kafka moves data. Stream processors transform it. Here's how the major engines compare.
| Engine | Latency | State | Best For | Learning Curve |
|---|---|---|---|---|
| Apache Flink | < 10ms | Rich (RocksDB-backed) | Complex event processing, large state | Steep |
| Spark Structured Streaming | 100ms - seconds | Good (checkpoint-based) | Teams already using Spark for batch | Moderate |
| ksqlDB | < 100ms | Good (Kafka Streams-backed) | SQL-first stream processing | Low (SQL) |
| Kafka Streams | < 10ms | Good (RocksDB) | Java/Kotlin microservices | Moderate |
Apache Flink — Fraud Detection Example
// Flink DataStream API — detect rapid transactions from same card
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Transaction> transactions = env
.addSource(new FlinkKafkaConsumer<>("transactions",
new TransactionDeserializer(), kafkaProps));
// Pattern: 3+ transactions within 60 seconds from same card
Pattern<Transaction, ?> fraudPattern = Pattern.<Transaction>begin("first")
.where(new SimpleCondition<Transaction>() {
public boolean filter(Transaction t) { return t.getAmount() > 100; }
})
.followedBy("second").where(/* same card */)
.followedBy("third").where(/* same card */)
.within(Time.seconds(60));
PatternStream<Transaction> patternStream = CEP.pattern(
transactions.keyBy(Transaction::getCardId), fraudPattern);
patternStream.select(new PatternSelectFunction<Transaction, FraudAlert>() {
public FraudAlert select(Map<String, List<Transaction>> pattern) {
return new FraudAlert(pattern.get("first").get(0).getCardId(),
"3 transactions > $100 within 60 seconds");
}
}).addSink(new FlinkKafkaProducer<>("fraud-alerts", ...));
env.execute("Fraud Detection Pipeline");
ksqlDB — Stream Processing with SQL
-- Create a stream from a Kafka topic
CREATE STREAM orders (
order_id VARCHAR KEY,
customer_id VARCHAR,
amount DECIMAL(10,2),
status VARCHAR,
created_at TIMESTAMP
) WITH (
kafka_topic = 'orders',
value_format = 'JSON',
timestamp = 'created_at'
);
-- Real-time aggregation: revenue per customer in sliding 1-hour window
CREATE TABLE customer_hourly_spend AS
SELECT customer_id,
SUM(amount) AS total_spent,
COUNT(*) AS order_count
FROM orders
WINDOW TUMBLING (SIZE 1 HOUR)
GROUP BY customer_id
HAVING SUM(amount) > 1000
EMIT CHANGES;
-- This automatically produces to a Kafka topic
-- Downstream services consume it for real-time alerts
4. Stream Processing Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Tumbling Window | Fixed-size, non-overlapping time windows | Hourly revenue totals |
| Sliding Window | Overlapping windows that slide by an interval | Moving average, rate limiting |
| Session Window | Gap-based windows (close after inactivity) | User sessions, click streams |
| Stream-Stream Join | Join two streams within a time window | Match orders with payments |
| Stream-Table Join | Enrich stream events with reference data | Enrich orders with customer details |
| CDC (Change Data Capture) | Capture database changes as events | Sync databases, build materialized views |
CDC with Debezium — Database to Kafka
// Debezium connector config — captures every INSERT/UPDATE/DELETE from PostgreSQL
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "orders-db.internal",
"database.port": "5432",
"database.dbname": "orders",
"database.user": "debezium",
"table.include.list": "public.orders,public.order_items",
"topic.prefix": "cdc",
"slot.name": "debezium_orders",
"plugin.name": "pgoutput"
}
}
// Result: Every change in orders table → Kafka topic "cdc.public.orders"
// Event payload includes: before (old values), after (new values), operation (c/u/d)
// Downstream consumers get real-time database change events
5. Exactly-Once Semantics — The Hardest Problem
In distributed systems, messages can be lost (at-most-once), duplicated (at-least-once), or processed exactly once. Exactly-once is what everyone wants and the hardest to achieve.
| Guarantee | Behavior | Use When | Performance |
|---|---|---|---|
| At-most-once | Fire and forget — may lose messages | Metrics, logs, non-critical events | Fastest |
| At-least-once | Retry on failure — may duplicate | Most workloads (with idempotent consumers) | Fast |
| Exactly-once | Each message processed exactly once | Financial transactions, billing | Slowest (transactions) |
6. Reference Architectures
Lambda Architecture (Batch + Stream)
┌─── Batch Layer (Spark) ──→ Batch Views ───┐
Raw Events ──→ Kafka ─┤ ├──→ Serving Layer
└─── Speed Layer (Flink) ──→ Real-Time Views ─┘
Pros: Accurate (batch corrects streaming approximations)
Cons: Maintaining two codepaths, double the complexity
Kappa Architecture (Stream Only)
Raw Events ──→ Kafka ──→ Stream Processor (Flink) ──→ Serving Layer
Reprocessing: Reset consumer offset → replay from Kafka
Pros: Single codebase, simpler ops
Cons: Kafka retention must be long enough for full reprocessing
Our recommendation: Start with Kappa. Lambda only when you genuinely need batch-level accuracy that streaming can't provide (rare in 2026 with Flink's exactly-once guarantees).
Event-Driven Microservices
Order Service ──→ "order.placed" ──→ Kafka ──┬──→ Payment Service
├──→ Inventory Service
├──→ Notification Service
└──→ Analytics Pipeline
Each service:
- Consumes events it cares about
- Produces events for downstream services
- Owns its own database (no shared state)
- Can be independently scaled and deployed
7. Operational Considerations
| Concern | Solution | Tools |
|---|---|---|
| Consumer lag monitoring | Alert when lag exceeds threshold | Burrow, Kafka Exporter + Prometheus |
| Schema evolution | Enforce compatible schema changes | Schema Registry (Avro/Protobuf) |
| Dead letter queues | Route failed messages for investigation | Dedicated DLQ topics |
| Backpressure | Slow down producers when consumers lag | Flink built-in, Kafka quotas |
| Data lineage | Track data flow through pipelines | OpenLineage, DataHub, Marquez |
Frequently Asked Questions
Do I need Kafka for real-time processing?
Not always. For simple use cases (< 10K events/sec, single consumer), Redis Streams or even a PostgreSQL LISTEN/NOTIFY can work. Kafka shines at scale (millions of events/sec), multi-consumer scenarios, and when you need durable event replay. Don't bring Kafka into a project that could use a simple message queue.
How many Kafka partitions should I use?
Start with the number of consumers you expect in your largest consumer group. More partitions = more parallelism, but also more overhead. A good starting point is 6-12 partitions for most topics, scaling to 30-50 for high-throughput topics. You can increase partitions later but not decrease them.
Flink or Spark Structured Streaming?
Flink for true real-time (sub-second latency, complex event processing, large stateful operations). Spark Streaming if your team already uses Spark for batch and you can tolerate 100ms+ latency. Flink is the better stream processor; Spark is the better unified batch+stream platform.
How do I handle late-arriving events?
Use watermarks and allowed lateness. Watermarks tell the processor "all events before this timestamp have arrived." Allowed lateness keeps windows open for stragglers. Flink handles this natively — configure a watermark strategy with bounded out-of-orderness (e.g., 30 seconds) and set allowed lateness based on your business requirements.
What's the cost of running a Kafka cluster?
Self-managed: 3-5 brokers at ~$500-1,500/month each (cloud VMs) plus storage. Managed services: Confluent Cloud starts at ~$1/hour for basic clusters; Amazon MSK at ~$0.25/broker-hour. For most teams, managed services cost 20-30% more but save significant engineering time. Factor in the cost of a dedicated Kafka engineer ($150K+ salary) when comparing.
Pillai Infotech LLP
We design and build real-time data pipelines — from Kafka deployment to Flink stream processing. Let's architect your streaming infrastructure.