In This Guide
- 1. Why Event-Driven? The Coupling Problem
- 2. Core EDA Patterns
- 3. Message Brokers Compared
- 4. Event Sourcing — When the Event IS the Data
- 5. CQRS — Separate Reads from Writes
- 6. Eventual Consistency and the Saga Pattern
- 7. Practical Implementation with Kafka
- 8. Pitfalls That Sink EDA Projects
- 9. When NOT to Use Event-Driven Architecture
- 10. Frequently Asked Questions
Event-driven architecture decouples services by having them communicate through events instead of direct API calls. A service publishes an event ("order placed"), and any interested service reacts to it — the publisher doesn't know or care who's listening. We moved one of our client's order processing system from synchronous REST calls to event-driven with Kafka, and what used to be a cascading failure across 6 services when payments went down became a 30-second delay with automatic retry. The system stayed up. But getting there required rethinking how we handle data consistency, debugging, and error recovery.
1. Why Event-Driven? The Coupling Problem
In a traditional request-response system, Service A calls Service B, which calls Service C. If C is down, B times out, which makes A time out, and your user sees an error. This is temporal coupling — services must be available at the same time.
Event-driven architecture breaks this chain. Service A publishes an event and moves on. Services B and C process it whenever they're ready — in milliseconds or in minutes. This isn't just about fault tolerance. It changes how teams work. The payments team can deploy independently because they only consume events — they don't need the orders team to update an API contract.
| Dimension | Request-Response | Event-Driven |
|---|---|---|
| Temporal coupling | All services must be up simultaneously | Services process at their own pace |
| Knowledge coupling | Caller knows the callee's API | Publisher doesn't know consumers |
| Failure propagation | Cascading failures across the chain | Isolated — failed consumers retry independently |
| Scaling | Scale each service in the call chain | Scale consumers independently per topic |
| Adding features | Modify existing services, coordinate deploys | Add new consumer — zero changes to publisher |
| Debugging | Follow the call stack | Correlate across distributed logs (harder) |
The debugging row is real. We've spent hours tracing why an order's email didn't send, only to discover the email consumer was 3 minutes behind because a batch job had flooded the topic. With request-response, you'd have gotten an error immediately. That's the trade-off.
2. Core EDA Patterns
Event Notification
The simplest pattern. A service publishes a thin event saying "something happened" — just an ID and event type. Consumers fetch details themselves if needed.
// Event notification — thin payload
{
"event": "order.placed",
"orderId": "ord_8x92k",
"timestamp": "2026-01-22T10:30:00Z",
"version": 1
}
// Consumer fetches details if it needs them
// GET /api/orders/ord_8x92k
Pro: Small messages, publisher doesn't need to know what consumers want. Con: Creates runtime coupling — consumer must call back to the publisher to get details.
Event-Carried State Transfer
The event carries all the data consumers need. No callbacks. This is what we use for most of our systems.
// Event-carried state — fat payload
{
"event": "order.placed",
"orderId": "ord_8x92k",
"timestamp": "2026-01-22T10:30:00Z",
"version": 1,
"data": {
"customerId": "cust_4fn2",
"email": "buyer@example.com",
"items": [
{ "sku": "WIDGET-01", "qty": 3, "price": 29.99 }
],
"total": 89.97,
"currency": "USD",
"shippingAddress": { "city": "Mumbai", "zip": "400001" }
}
}
Pro: True decoupling — consumers never call back. Con: Larger messages, and consumers might cache stale data. We mitigate this by always including a version number and using idempotency keys.
Domain Events vs Integration Events
This distinction matters more than most teams realize:
- Domain events — internal to a bounded context. "InventoryReserved" is meaningful inside the warehouse service. Fine-grained, can change freely.
- Integration events — cross service boundaries. "OrderShipped" goes to billing, notifications, analytics. These are your API contract — version them carefully.
We keep domain events on an internal bus (in-process or Redis Streams) and only publish integration events to Kafka. This prevents internal refactoring from breaking downstream consumers.
3. Message Brokers Compared
The broker you choose shapes your architecture more than you'd expect. Here's what actually matters after running each in production:
| Broker | Best For | Throughput | Retention | Ops Complexity |
|---|---|---|---|---|
| Apache Kafka | Event streaming, replay, high volume | 1M+ msgs/sec | Configurable (days/forever) | High (ZooKeeper/KRaft) |
| RabbitMQ | Task queues, routing, RPC | 50K msgs/sec | Until consumed (no replay) | Low-Medium |
| Amazon SQS/SNS | AWS-native, serverless | Nearly unlimited | 14 days max (SQS) | Zero (managed) |
| Redis Streams | Lightweight, already-have-Redis | 200K msgs/sec | Memory-limited | Low |
| NATS JetStream | Edge, IoT, lightweight streaming | 500K+ msgs/sec | Configurable | Low |
| Amazon EventBridge | Event routing, schema registry | High (managed) | Replay up to 24hrs | Zero (managed) |
4. Event Sourcing — When the Event IS the Data
Most applications store current state: "Account balance is $500." Event sourcing stores every change that led to that state: "Deposited $1000, Withdrew $300, Deposited $100, Withdrew $300." The current state is derived by replaying events.
This sounds academic until you need it. A fintech client needed to answer "Why does this account show $347.23?" — with a traditional database, you'd look at the balance column and shrug. With event sourcing, you replay the event stream and see every deposit, withdrawal, fee, and correction that produced that number.
Traditional DB (stores current state):
┌─────────────────────────────────────┐
│ accounts table │
│ id: acc_001 │ balance: 500.00 │
└─────────────────────────────────────┘
How did we get here? No idea.
Event Store (stores every change):
┌─────────────────────────────────────────────────┐
│ Stream: account-acc_001 │
├──────┬─────────────────┬────────────┬───────────┤
│ Seq │ Event Type │ Amount │ Balance │
├──────┼─────────────────┼────────────┼───────────┤
│ 1 │ AccountOpened │ +1000.00 │ 1000.00 │
│ 2 │ FundsWithdrawn │ -300.00 │ 700.00 │
│ 3 │ FundsDeposited │ +100.00 │ 800.00 │
│ 4 │ FundsWithdrawn │ -300.00 │ 500.00 │
└──────┴─────────────────┴────────────┴───────────┘
Complete audit trail. Replay to any point in time.
Event Sourcing Implementation (Node.js/TypeScript)
// Event types
type AccountEvent =
| { type: 'AccountOpened'; balance: number }
| { type: 'FundsDeposited'; amount: number }
| { type: 'FundsWithdrawn'; amount: number }
| { type: 'AccountFrozen'; reason: string };
// State derived from events — never stored directly
interface AccountState {
id: string;
balance: number;
status: 'active' | 'frozen';
version: number;
}
// Pure function: fold events into state
function applyEvent(state: AccountState, event: AccountEvent): AccountState {
switch (event.type) {
case 'AccountOpened':
return { ...state, balance: event.balance, status: 'active' };
case 'FundsDeposited':
return { ...state, balance: state.balance + event.amount };
case 'FundsWithdrawn':
return { ...state, balance: state.balance - event.amount };
case 'AccountFrozen':
return { ...state, status: 'frozen' };
}
}
// Rebuild state by replaying all events
function rebuildAccount(events: AccountEvent[]): AccountState {
const initial: AccountState = { id: '', balance: 0, status: 'active', version: 0 };
return events.reduce((state, event, idx) => ({
...applyEvent(state, event),
version: idx + 1
}), initial);
}
When Event Sourcing Makes Sense
| Use It When | Skip It When |
|---|---|
| Full audit trail is a requirement (finance, healthcare) | Simple CRUD with no history needs |
| You need to reconstruct past states ("what was the balance on March 3?") | Current state is all that matters |
| Multiple read models from the same data (reporting, search, analytics) | Single read pattern |
| Domain experts think in events ("order placed", "payment received") | Your team hasn't worked with eventual consistency before |
5. CQRS — Separate Reads from Writes
Command Query Responsibility Segregation means using different models for reading and writing data. Commands change state (PlaceOrder, CancelSubscription). Queries read state (GetOrderDetails, ListActiveSubscriptions). In a traditional app, both use the same database and the same model. CQRS splits them.
CQRS Architecture:
┌──────────┐ ┌──────────────┐
│ Commands │──────────▶│ Write Model │
│ (POST/PUT)│ │ (normalized) │
└──────────┘ └──────┬───────┘
│ Events
▼
┌──────────────┐
│ Event Bus │
└──────┬───────┘
│
▼
┌──────────┐ ┌──────────────┐
│ Queries │◀─────────── │ Read Model │
│ (GET) │ │ (denormalized)│
└──────────┘ └──────────────┘
Write: PostgreSQL (normalized, ACID)
Read: Elasticsearch (search), Redis (dashboard), DynamoDB (API)
Why bother? Because reads and writes have fundamentally different needs. Your write model should be normalized and enforce business rules. Your read model should be denormalized and optimized for the exact queries your UI needs. An e-commerce product page needs: product details, reviews, pricing, inventory, related products. Joining 6 tables on every page load is painful. With CQRS, you pre-compute a single document with everything the page needs.
We use CQRS without event sourcing on most projects. They pair well together, but they're independent patterns. You can have CQRS with a regular database — just use separate tables or databases for reads and writes, with a projection that keeps the read side updated.
6. Eventual Consistency and the Saga Pattern
In distributed event-driven systems, you lose ACID transactions across services. You place an order (Order Service), reserve inventory (Inventory Service), charge the card (Payment Service). What if payment fails after inventory is reserved? You need a way to coordinate rollback across services.
Saga Pattern: Choreography vs Orchestration
| Choreography | Orchestration | |
|---|---|---|
| How it works | Each service listens for events and reacts | A central orchestrator tells each service what to do |
| Coupling | Low — services only know about events | Higher — orchestrator knows all participants |
| Visibility | Hard to see the full flow | Clear — saga state machine in one place |
| Best for | 3-4 steps, independent teams | 5+ steps, complex compensation logic |
| Our pick | Simple flows, early-stage systems | Order processing, payments, anything with money |
Orchestrated Saga Example (Order Flow)
Order Saga — Happy Path:
OrderSaga ──▶ InventoryService: ReserveItems
◀── ItemsReserved
──▶ PaymentService: ChargeCard
◀── PaymentSucceeded
──▶ ShippingService: CreateShipment
◀── ShipmentCreated
──▶ OrderService: ConfirmOrder
Order Saga — Payment Failed (compensating actions):
OrderSaga ──▶ InventoryService: ReserveItems
◀── ItemsReserved
──▶ PaymentService: ChargeCard
◀── PaymentFailed
──▶ InventoryService: ReleaseItems (compensate)
◀── ItemsReleased
──▶ OrderService: RejectOrder
// Saga orchestrator (simplified TypeScript)
class OrderSaga {
private steps = [
{
action: (ctx) => this.inventory.reserve(ctx.items),
compensate: (ctx) => this.inventory.release(ctx.items)
},
{
action: (ctx) => this.payment.charge(ctx.total, ctx.paymentMethod),
compensate: (ctx) => this.payment.refund(ctx.paymentId)
},
{
action: (ctx) => this.shipping.create(ctx.orderId, ctx.address),
compensate: (ctx) => this.shipping.cancel(ctx.shipmentId)
}
];
async execute(ctx: OrderContext) {
const completed: number[] = [];
for (let i = 0; i < this.steps.length; i++) {
try {
await this.steps[i].action(ctx);
completed.push(i);
} catch (error) {
// Compensate in reverse order
for (const j of completed.reverse()) {
await this.steps[j].compensate(ctx);
}
throw new SagaFailedError(i, error);
}
}
}
}
7. Practical Implementation with Kafka
Here's a real implementation pattern we use for order processing. This handles idempotency, dead letter queues, and schema evolution — the three things most tutorials skip.
Producer: Publishing Events with Outbox Pattern
Never publish events directly to Kafka from your application. If the DB write succeeds but the Kafka publish fails, you've got an inconsistent state. Use the outbox pattern: write the event to a database table in the same transaction as the state change, then a separate process publishes from the outbox to Kafka.
-- Outbox table (PostgreSQL)
CREATE TABLE event_outbox (
id BIGSERIAL PRIMARY KEY,
aggregate_type VARCHAR(100) NOT NULL,
aggregate_id VARCHAR(100) NOT NULL,
event_type VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
published_at TIMESTAMPTZ NULL
);
-- In the same transaction as the order insert:
BEGIN;
INSERT INTO orders (id, customer_id, total, status)
VALUES ('ord_8x92k', 'cust_4fn2', 89.97, 'placed');
INSERT INTO event_outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('Order', 'ord_8x92k', 'order.placed', '{
"orderId": "ord_8x92k",
"customerId": "cust_4fn2",
"total": 89.97,
"items": [{"sku": "WIDGET-01", "qty": 3}]
}'::jsonb);
COMMIT;
Consumer: Idempotent Processing
Consumers WILL receive duplicate events. Kafka guarantees at-least-once delivery, not exactly-once. Your consumer must be idempotent — processing the same event twice should produce the same result.
// Idempotent consumer (Node.js + KafkaJS)
import { Kafka } from 'kafkajs';
const kafka = new Kafka({ brokers: ['kafka:9092'] });
const consumer = kafka.consumer({ groupId: 'email-service' });
await consumer.subscribe({ topic: 'order-events', fromBeginning: false });
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
const event = JSON.parse(message.value.toString());
const eventId = message.headers['event-id']?.toString();
// Idempotency check — skip if already processed
const existing = await db.query(
'SELECT 1 FROM processed_events WHERE event_id = $1', [eventId]
);
if (existing.rows.length > 0) return; // duplicate, skip
try {
await db.query('BEGIN');
// Process the event
if (event.type === 'order.placed') {
await sendOrderConfirmationEmail(event.data);
}
// Record as processed (same transaction)
await db.query(
'INSERT INTO processed_events (event_id, processed_at) VALUES ($1, NOW())',
[eventId]
);
await db.query('COMMIT');
} catch (error) {
await db.query('ROLLBACK');
// Don't commit offset — message will be retried
throw error;
}
}
});
Dead Letter Queue (DLQ)
After N retries, move failed events to a DLQ for manual inspection. Don't let a single bad event block the entire partition.
// DLQ wrapper
async function processWithDLQ(event, handler, maxRetries = 3) {
const retryCount = parseInt(event.headers['retry-count'] || '0');
try {
await handler(event);
} catch (error) {
if (retryCount >= maxRetries) {
// Send to dead letter topic
await producer.send({
topic: `${event.topic}.dlq`,
messages: [{
key: event.key,
value: event.value,
headers: {
...event.headers,
'error-message': error.message,
'failed-at': new Date().toISOString()
}
}]
});
console.error(`Event sent to DLQ after ${maxRetries} retries`, event);
} else {
// Re-publish with incremented retry count
await producer.send({
topic: event.topic,
messages: [{
key: event.key,
value: event.value,
headers: { ...event.headers, 'retry-count': String(retryCount + 1) }
}]
});
}
}
}
8. Pitfalls That Sink EDA Projects
We've built event-driven systems for multiple clients. Here are the mistakes that hurt most:
| Pitfall | What Happens | Prevention |
|---|---|---|
| No schema registry | Producer changes event format, consumers break silently | Use Confluent Schema Registry or AWS Glue Schema Registry |
| No correlation IDs | Can't trace a request across 8 services | Generate correlation ID at entry point, propagate in every event header |
| Ordering assumptions | "OrderPaid" arrives before "OrderPlaced" | Use partition keys (order ID) for ordering guarantees within a partition |
| No dead letter queue | Poison message blocks the partition forever | DLQ after 3-5 retries with exponential backoff |
| Event explosion | 100+ event types, nobody knows what triggers what | Event catalog + AsyncAPI spec. Review new events like you review API changes. |
| Testing in isolation | Unit tests pass, integration fails at 2am | Contract tests (Pact) + end-to-end event flow tests with Testcontainers |
| No backpressure | Fast producer overwhelms slow consumer, lag grows unbounded | Monitor consumer lag, auto-scale consumer groups, set rate limits on producers |
9. When NOT to Use Event-Driven Architecture
EDA adds complexity. It's justified when you get decoupling, scalability, and resilience in return. It's not justified when a simple REST call would do the job.
- Small team, single service — If 3 developers maintain one monolith, events between internal modules add complexity with no decoupling benefit. Use function calls.
- Strong consistency required — Banking transfers between accounts within the same bank should be ACID transactions, not eventual consistency. Don't fight the requirement.
- Simple CRUD apps — A content management system that creates, reads, updates, and deletes articles doesn't need Kafka. PostgreSQL triggers or simple webhooks work fine.
- Real-time request-response — User clicks "checkout" and needs immediate confirmation? That's a synchronous flow. You can publish events after the synchronous confirmation, but the primary path should be direct.
- Team doesn't understand eventual consistency — If your team has never debugged a distributed system, introducing EDA under deadline pressure will end badly. Train first, adopt gradually.
10. Frequently Asked Questions
What's the difference between event-driven architecture and message-driven architecture?
Event-driven systems publish facts about what happened ("OrderPlaced") — producers don't care who consumes them. Message-driven systems send commands to specific recipients ("ProcessPayment") — the sender expects a particular receiver to act. Most real systems use both: events for notifications, messages for commands. Kafka is event-oriented, RabbitMQ is message-oriented, though both can do either.
How do I handle event schema evolution without breaking consumers?
Use a schema registry (Confluent or AWS Glue) that enforces backward compatibility. Add new fields as optional — never remove or rename existing ones. Version your event types (order.placed.v2) and run old and new consumers in parallel during migration. We use Avro schemas for strict contracts and JSON Schema for lighter use cases.
Is Kafka overkill for a startup?
Probably, yes. If you're processing under 10,000 events per second, Redis Streams, Amazon SQS, or even PostgreSQL LISTEN/NOTIFY can handle it. Kafka shines at massive scale, stream processing, and event replay — features most startups don't need in year one. Start simple, migrate when you hit actual limits. We've seen too many seed-stage companies spend months on Kafka infrastructure instead of shipping features.
How do I debug issues in an event-driven system?
Three essentials: correlation IDs in every event header (trace a request across all services), centralized logging with structured JSON (we use ELK stack), and distributed tracing (OpenTelemetry with Jaeger). Monitor consumer lag religiously — if a consumer falls behind, you'll get stale data and complaints before you notice. Kafka Manager or Confluent Control Center make lag visible.
Can I use event-driven architecture with a monolith?
Absolutely — and this is often the smartest starting point. Use an in-process event bus (like MediatR in .NET or a simple EventEmitter in Node.js) within your monolith. Modules publish and subscribe to events without HTTP calls. When you eventually extract a module into its own service, you replace the in-process bus with Kafka or RabbitMQ. The event contracts stay the same. We call this "monolith with event seams."
We build distributed systems for clients ranging from fintech startups to enterprise logistics platforms. Our event-driven implementations process millions of events daily across Kafka, RabbitMQ, and AWS-native stacks. This guide reflects patterns we've validated in production — not theoretical ideals.