In This Guide
- 1. The Eight Fallacies of Distributed Computing
- 2. CAP Theorem — The Impossible Triangle
- 3. Consistency Models
- 4. Replication Strategies
- 5. Partitioning and Sharding
- 6. Consensus — Getting Nodes to Agree
- 7. Failure Detection and Recovery
- 8. Time and Ordering in Distributed Systems
- 9. Frequently Asked Questions
A distributed system is any system where components on different networked computers communicate and coordinate to achieve a common goal. The moment you add a second database server, a cache layer, or a message queue — you're running a distributed system. And distributed systems fail in ways that single-server systems never do. We've debugged production incidents where two database replicas disagreed on the latest order total, where a "successful" API call never reached the payment processor, and where a clock skew of 30 seconds caused transactions to replay. This guide covers the fundamentals that help you anticipate and handle these failures.
1. The Eight Fallacies of Distributed Computing
Peter Deutsch and James Gosling defined these in 1994. They're still the source of most distributed system bugs 30 years later.
| Fallacy | The Assumption | Reality |
|---|---|---|
| 1 | The network is reliable | Packets get dropped, connections reset, switches fail. Always implement retries and timeouts. |
| 2 | Latency is zero | Cross-region: 50-200ms. Even same-datacenter: 0.5-2ms. Every network call adds up. |
| 3 | Bandwidth is infinite | Streaming 1GB/s of data between services will saturate your network link. |
| 4 | The network is secure | Internal traffic can be intercepted. Use mTLS between services. |
| 5 | Topology doesn't change | Services move, IPs change, containers reschedule. Use service discovery. |
| 6 | There is one administrator | Multiple teams, multiple clouds, multiple trust domains. |
| 7 | Transport cost is zero | Serialization, deserialization, and data transfer cost CPU and bandwidth. |
| 8 | The network is homogeneous | Different OS, languages, protocols, and versions across services. |
2. CAP Theorem — The Impossible Triangle
The CAP theorem states that a distributed system can provide at most two of three guarantees: Consistency (every read returns the most recent write), Availability (every request gets a response), and Partition tolerance (the system works despite network failures between nodes).
Since network partitions are inevitable in any distributed system, the real choice is between CP (sacrifice availability during partitions) and AP (sacrifice consistency during partitions).
| Choice | During Partition | Use Case | Example |
|---|---|---|---|
| CP | Returns error rather than stale data | Financial transactions, inventory counts | PostgreSQL, MongoDB (default), etcd, ZooKeeper |
| AP | Returns potentially stale data but stays available | Social feeds, shopping carts, DNS | Cassandra, DynamoDB, CouchDB, DNS |
3. Consistency Models
| Model | Guarantee | Performance | Used By |
|---|---|---|---|
| Strong (Linearizable) | Every read sees the latest write, globally | Slowest (needs coordination) | etcd, Spanner, CockroachDB |
| Sequential | All nodes see operations in the same order | Moderate | ZooKeeper |
| Causal | Causally related operations are ordered; concurrent operations may differ | Good | MongoDB (causal sessions) |
| Eventual | All replicas converge eventually (seconds to minutes) | Fastest | DynamoDB, Cassandra, S3, DNS |
Eventual consistency is not "broken consistency." It's a deliberate trade-off: you accept that reads might be slightly stale in exchange for much higher availability and performance. Amazon's shopping cart uses eventual consistency — if two tabs add items simultaneously, both items eventually appear. The alternative (strong consistency) would mean slower add-to-cart operations across global regions.
4. Replication Strategies
Replication copies data across multiple nodes for fault tolerance and read scaling.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Single-leader | One node accepts writes, replicates to followers | Simple but leader is bottleneck and SPOF. PostgreSQL, MySQL default. |
| Multi-leader | Multiple nodes accept writes, sync with each other | Write conflicts need resolution. Good for multi-region. CockroachDB, Galera. |
| Leaderless | Any node accepts reads/writes, quorum for consistency | Highly available, complex conflict resolution. Cassandra, DynamoDB. |
Quorum reads/writes (leaderless replication):
N = 3 replicas
W = 2 (write succeeds when 2 of 3 nodes acknowledge)
R = 2 (read queries 2 of 3 nodes, takes newest)
W + R > N → 2 + 2 > 3 → guaranteed to read latest write
Write "balance = 500":
Node A: balance = 500 (ack)
Node B: balance = 500 (ack) ← write succeeds (W=2 met)
Node C: balance = 300 (not yet replicated)
Read (queries 2 nodes):
Node A: balance = 500 (timestamp: T2)
Node C: balance = 300 (timestamp: T1)
Result: 500 (latest timestamp wins)
5. Partitioning and Sharding
When data outgrows a single node, you split it across multiple nodes. Each partition (shard) holds a subset of the data. The challenge: choosing a partition key that distributes data evenly and supports your query patterns.
Consistent Hashing — the standard partitioning approach:
Ring: 0 ─────── 2^32
Nodes placed on ring by hash(node_id):
Node A at position 1000
Node B at position 4000
Node C at position 7000
Key "user_123" hashes to position 2500
→ Walk clockwise → lands on Node B
When Node B fails:
Only keys between A(1000) and B(4000) redistribute to C
Nodes A and C keep their existing keys
(Minimal redistribution vs. hash(key) % N which reshuffles everything)
For more on database sharding strategies, see our System Design Guide.
6. Consensus — Getting Nodes to Agree
Consensus algorithms let distributed nodes agree on a value even when some nodes fail. This is fundamental for leader election, distributed locks, and configuration management.
| Algorithm | Used By | Tolerance | Key Idea |
|---|---|---|---|
| Raft | etcd, Consul, CockroachDB | Crash failures (not Byzantine) | Leader-based. Designed to be understandable. Terms + log replication. |
| Paxos | Google Chubby, Spanner | Crash failures | Proposer/acceptor/learner roles. Theoretically elegant, hard to implement. |
| PBFT | Blockchain, some databases | Byzantine faults (malicious nodes) | Tolerates N/3 malicious nodes. High message overhead. |
| ZAB | ZooKeeper | Crash failures | Optimized for primary-backup ordering. Total order broadcast. |
You'll rarely implement consensus yourself. Instead, you'll use systems built on it: etcd for distributed configuration, ZooKeeper for leader election, Redis Redlock for distributed locks. The key insight: consensus requires a majority of nodes to be available (3 nodes tolerate 1 failure, 5 tolerate 2). Always run an odd number of nodes.
7. Failure Detection and Recovery
In a distributed system, failure detection is fundamentally uncertain. If a node doesn't respond, it could be crashed, slow, or the network between you could be partitioned. You can't tell the difference.
| Pattern | How It Works | Use Case |
|---|---|---|
| Heartbeat | Nodes send periodic "I'm alive" messages | Leader election (miss 3 heartbeats = assume dead) |
| Phi Accrual | Probabilistic — estimates failure likelihood based on heartbeat history | Cassandra, Akka. Adaptive to network conditions. |
| Gossip protocol | Nodes randomly share state with peers; failures propagate like rumors | Membership discovery in large clusters (100+ nodes) |
| Circuit breaker | Stop calling a failed service; fail fast with cached response | Service-to-service calls. Prevent cascading failures. |
8. Time and Ordering in Distributed Systems
Physical clocks on different machines drift. You can't rely on timestamps to determine which event happened first across machines. A clock skew of even 100ms can cause "impossible" bugs: an order appears to be paid before it was placed.
| Approach | How It Works | Used By |
|---|---|---|
| NTP (Network Time Protocol) | Syncs clocks across network. Accuracy: 1-50ms typically. | Best effort. Good enough for logging, not for ordering. |
| Lamport timestamps | Logical counter incremented on each event. Partial ordering. | Lightweight ordering. Doesn't capture concurrency. |
| Vector clocks | Each node maintains a counter per node. Detects concurrent events. | DynamoDB (version vectors), Riak. Conflict detection. |
| TrueTime (Google) | GPS + atomic clocks. Returns [earliest, latest] time interval. | Google Spanner. Enables global strong consistency. |
9. Frequently Asked Questions
Do I need to understand distributed systems if I'm building a simple web app?
If your app uses a separate database server, a Redis cache, or a message queue — yes, you're running a distributed system. Understanding failure modes (network partitions, replication lag, split brain) helps you avoid bugs that only appear in production under load. You don't need to implement Raft, but you should understand why your read replica might return stale data.
What's the difference between replication and sharding?
Replication copies the same data to multiple nodes for fault tolerance and read scaling. Sharding splits different data across different nodes for write scaling and storage capacity. Most production systems use both: data is sharded across nodes, and each shard is replicated for redundancy. Think of replication as backup copies and sharding as dividing the workload.
How do I handle distributed transactions across microservices?
Don't use two-phase commit (2PC) — it's slow and blocks all participants if the coordinator fails. Use the Saga pattern instead: each service executes its local transaction and publishes an event. If a step fails, compensating transactions undo previous steps. Our Event-Driven Architecture guide covers saga choreography and orchestration in detail.
What books should I read to learn distributed systems?
Start with "Designing Data-Intensive Applications" by Martin Kleppmann — it's the single best resource, covering everything from replication to consensus to stream processing. Follow with "Understanding Distributed Systems" by Roberto Vitillo for a more practical, shorter introduction. For deep theory, "Distributed Systems" by Maarten van Steen and Andrew Tanenbaum is the academic standard.
We've debugged distributed system failures across PostgreSQL replication lag, Kafka consumer group rebalances, and Redis cluster split-brain scenarios. These fundamentals aren't academic for us — they're the concepts we apply daily when designing systems that need to stay up.