Distributed Systems Fundamentals | Pillai Infotech LLP

Q: Do I need to understand distributed systems if I'm building a simple web app?

If your app uses a separate database, cache, or message queue — yes. Understanding failure modes helps avoid production bugs.

Q: How do I handle distributed transactions across microservices?

Use the Saga pattern instead of 2PC. Each service executes locally and publishes events. Compensating transactions handle failures.

Q: What books should I read to learn distributed systems?

Start with 'Designing Data-Intensive Applications' by Martin Kleppmann. Follow with 'Understanding Distributed Systems' by Roberto Vitillo.

Distributed Systems Fundamentals Every Developer Should Know

Your single-server app works perfectly. Then you add a second server and everything breaks. Distributed systems are hard because the network is unreliable, clocks drift, and processes crash — this guide teaches you to design for that reality.

💻 Architecture January 17, 2026 14 min read

In This Guide

1. The Eight Fallacies of Distributed Computing
2. CAP Theorem — The Impossible Triangle
3. Consistency Models
4. Replication Strategies
5. Partitioning and Sharding
6. Consensus — Getting Nodes to Agree
7. Failure Detection and Recovery
8. Time and Ordering in Distributed Systems
9. Frequently Asked Questions

A distributed system is any system where components on different networked computers communicate and coordinate to achieve a common goal. The moment you add a second database server, a cache layer, or a message queue — you're running a distributed system. And distributed systems fail in ways that single-server systems never do. We've debugged production incidents where two database replicas disagreed on the latest order total, where a "successful" API call never reached the payment processor, and where a clock skew of 30 seconds caused transactions to replay. This guide covers the fundamentals that help you anticipate and handle these failures.

1. The Eight Fallacies of Distributed Computing

Peter Deutsch and James Gosling defined these in 1994. They're still the source of most distributed system bugs 30 years later.

Fallacy	The Assumption	Reality
1	The network is reliable	Packets get dropped, connections reset, switches fail. Always implement retries and timeouts.
2	Latency is zero	Cross-region: 50-200ms. Even same-datacenter: 0.5-2ms. Every network call adds up.
3	Bandwidth is infinite	Streaming 1GB/s of data between services will saturate your network link.
4	The network is secure	Internal traffic can be intercepted. Use mTLS between services.
5	Topology doesn't change	Services move, IPs change, containers reschedule. Use service discovery.
6	There is one administrator	Multiple teams, multiple clouds, multiple trust domains.
7	Transport cost is zero	Serialization, deserialization, and data transfer cost CPU and bandwidth.
8	The network is homogeneous	Different OS, languages, protocols, and versions across services.

2. CAP Theorem — The Impossible Triangle

The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency (every read returns the most recent write), Availability (every request gets a response), and Partition tolerance (the system works despite network failures between nodes).

Since network partitions are inevitable in any distributed system, the real choice is between CP (sacrifice availability during partitions) and AP (sacrifice consistency during partitions).

Choice	During Partition	Use Case	Example
CP	Returns error rather than stale data	Financial transactions, inventory counts	PostgreSQL, MongoDB (default), etcd, ZooKeeper
AP	Returns potentially stale data but stays available	Social feeds, shopping carts, DNS	Cassandra, DynamoDB, CouchDB, DNS

The nuance most articles miss: CAP isn't an all-or-nothing choice. Different parts of your system can make different trade-offs. Your payment processing can be CP (never process a payment twice) while your product catalog is AP (showing a slightly stale price for 5 seconds is fine). We design systems with per-feature consistency requirements, not a single global choice.

3. Consistency Models

Model	Guarantee	Performance	Used By
Strong (Linearizable)	Every read sees the latest write, globally	Slowest (needs coordination)	etcd, Spanner, CockroachDB
Sequential	All nodes see operations in the same order	Moderate	ZooKeeper
Causal	Causally related operations are ordered; concurrent operations may differ	Good	MongoDB (causal sessions)
Eventual	All replicas converge eventually (seconds to minutes)	Fastest	DynamoDB, Cassandra, S3, DNS

Eventual consistency is not "broken consistency." It's a deliberate trade-off: you accept that reads might be slightly stale in exchange for much higher availability and performance. Amazon's shopping cart uses eventual consistency — if two tabs add items simultaneously, both items eventually appear. The alternative (strong consistency) would mean slower add-to-cart operations across global regions.

4. Replication Strategies

Replication copies data across multiple nodes for fault tolerance and read scaling.

Strategy	How It Works	Trade-off
Single-leader	One node accepts writes, replicates to followers	Simple but leader is bottleneck and SPOF. PostgreSQL, MySQL default.
Multi-leader	Multiple nodes accept writes, sync with each other	Write conflicts need resolution. Good for multi-region. CockroachDB, Galera.
Leaderless	Any node accepts reads/writes, quorum for consistency	Highly available, complex conflict resolution. Cassandra, DynamoDB.

        
Quorum reads/writes (leaderless replication):

N = 3 replicas

W = 2 (write succeeds when 2 of 3 nodes acknowledge)

R = 2 (read queries 2 of 3 nodes, takes newest)

W + R > N → 2 + 2 > 3 → guaranteed to read latest write

Write "balance = 500":

Node A: balance = 500  (ack)

Node B: balance = 500  (ack) ← write succeeds (W=2 met)

Node C: balance = 300  (not yet replicated)

Read (queries 2 nodes):

Node A: balance = 500  (timestamp: T2)

Node C: balance = 300  (timestamp: T1)

Result: 500 (latest timestamp wins)

5. Partitioning and Sharding

When data outgrows a single node, you split it across multiple nodes. Each partition (shard) holds a subset of the data. The challenge: choosing a partition key that distributes data evenly and supports your query patterns.

        
Consistent Hashing — the standard partitioning approach:

Ring: 0 ─────── 2^32

Nodes placed on ring by hash(node_id):

  Node A at position 1000

  Node B at position 4000

  Node C at position 7000

Key "user_123" hashes to position 2500

→ Walk clockwise → lands on Node B

When Node B fails:

Only keys between A(1000) and B(4000) redistribute to C

Nodes A and C keep their existing keys

(Minimal redistribution vs. hash(key) % N which reshuffles everything)

For more on database sharding strategies, see our System Design Guide.

6. Consensus — Getting Nodes to Agree

Consensus algorithms let distributed nodes agree on a value even when some nodes fail. This is fundamental for leader election, distributed locks, and configuration management.

Algorithm	Used By	Tolerance	Key Idea
Raft	etcd, Consul, CockroachDB	Crash failures (not Byzantine)	Leader-based. Designed to be understandable. Terms + log replication.
Paxos	Google Chubby, Spanner	Crash failures	Proposer/acceptor/learner roles. Theoretically elegant, hard to implement.
PBFT	Blockchain, some databases	Byzantine faults (malicious nodes)	Tolerates N/3 malicious nodes. High message overhead.
ZAB	ZooKeeper	Crash failures	Optimized for primary-backup ordering. Total order broadcast.

You'll rarely implement consensus yourself. Instead, you'll use systems built on it: etcd for distributed configuration, ZooKeeper for leader election, Redis Redlock for distributed locks. The key insight: consensus requires a majority of nodes to be available (3 nodes tolerate 1 failure, 5 tolerate 2). Always run an odd number of nodes.

7. Failure Detection and Recovery

In a distributed system, failure detection is fundamentally uncertain. If a node doesn't respond, it could be crashed, slow, or the network between you could be partitioned. You can't tell the difference.

Pattern	How It Works	Use Case
Heartbeat	Nodes send periodic "I'm alive" messages	Leader election (miss 3 heartbeats = assume dead)
Phi Accrual	Probabilistic — estimates failure likelihood based on heartbeat history	Cassandra, Akka. Adaptive to network conditions.
Gossip protocol	Nodes randomly share state with peers; failures propagate like rumors	Membership discovery in large clusters (100+ nodes)
Circuit breaker	Stop calling a failed service; fail fast with cached response	Service-to-service calls. Prevent cascading failures.

8. Time and Ordering in Distributed Systems

Physical clocks on different machines drift. You can't rely on timestamps to determine which event happened first across machines. A clock skew of even 100ms can cause "impossible" bugs: an order appears to be paid before it was placed.

Approach	How It Works	Used By
NTP (Network Time Protocol)	Syncs clocks across network. Accuracy: 1-50ms typically.	Best effort. Good enough for logging, not for ordering.
Lamport timestamps	Logical counter incremented on each event. Partial ordering.	Lightweight ordering. Doesn't capture concurrency.
Vector clocks	Each node maintains a counter per node. Detects concurrent events.	DynamoDB (version vectors), Riak. Conflict detection.
TrueTime (Google)	GPS + atomic clocks. Returns [earliest, latest] time interval.	Google Spanner. Enables global strong consistency.

Practical advice: Use NTP for logging and human-readable timestamps. Use monotonic clocks (not wall clocks) for measuring durations. For cross-service ordering, use Kafka partition ordering (events on the same partition key are totally ordered) or database sequence numbers. Don't build your own distributed ordering system unless you really, truly need it.

9. Frequently Asked Questions

Do I need to understand distributed systems if I'm building a simple web app?

If your app uses a separate database server, a Redis cache, or a message queue — yes, you're running a distributed system. Understanding failure modes (network partitions, replication lag, split brain) helps you avoid bugs that only appear in production under load. You don't need to implement Raft, but you should understand why your read replica might return stale data.

What's the difference between replication and sharding?

Replication copies the same data to multiple nodes for fault tolerance and read scaling. Sharding splits different data across different nodes for write scaling and storage capacity. Most production systems use both: data is sharded across nodes, and each shard is replicated for redundancy. Think of replication as backup copies and sharding as dividing the workload.

How do I handle distributed transactions across microservices?

Don't use two-phase commit (2PC) — it's slow and blocks all participants if the coordinator fails. Use the Saga pattern instead: each service executes its local transaction and publishes an event. If a step fails, compensating transactions undo previous steps. Our Event-Driven Architecture guide covers saga choreography and orchestration in detail.

What books should I read to learn distributed systems?

Start with "Designing Data-Intensive Applications" by Martin Kleppmann — it's the single best resource, covering everything from replication to consensus to stream processing. Follow with "Understanding Distributed Systems" by Roberto Vitillo for a more practical, shorter introduction. For deep theory, "Distributed Systems" by Maarten van Steen and Andrew Tanenbaum is the academic standard.

Pillai Infotech Engineering Team

We've debugged distributed system failures across PostgreSQL replication lag, Kafka consumer group rebalances, and Redis cluster split-brain scenarios. These fundamentals aren't academic for us — they're the concepts we apply daily when designing systems that need to stay up.

System Design: Architecture Principles for Scalable Systems → Event-Driven Architecture: Design Patterns and Implementation → Microservices vs Monolith: When to Make the Switch →