In This Guide
- 1. The Fundamentals That Actually Matter
- 2. Scaling — Vertical vs Horizontal
- 3. Load Balancing Strategies
- 4. Caching — The Biggest Performance Win
- 5. Database Selection and Sharding
- 6. Message Queues and Async Processing
- 7. CDN and Edge Architecture
- 8. Design Walkthrough: URL Shortener at Scale
- 9. Common System Design Mistakes
- 10. Frequently Asked Questions
System design is the art of making architecture decisions under uncertainty. You're choosing databases before you know your query patterns, picking queue systems before you understand your throughput, and designing APIs before the product team finalizes requirements. We've designed systems for clients that scale from zero to millions of requests — and the biggest lesson is that premature optimization kills more projects than lack of optimization. This guide covers the building blocks you actually need, when to use them, and when to wait.
1. The Fundamentals That Actually Matter
Before diving into specific components, understand the constraints every distributed system faces:
| Principle | What It Means | Real Impact |
|---|---|---|
| CAP Theorem | Pick 2 of 3: Consistency, Availability, Partition tolerance | Network partitions WILL happen. Choose CP (banking) or AP (social media feeds) |
| Latency numbers | L1 cache: 1ns, RAM: 100ns, SSD: 100µs, Network: 150ms | A disk read is 100,000x slower than L1 cache. Cache everything you can. |
| Back-of-envelope math | Estimate before building: QPS, storage, bandwidth | 1M DAU × 10 req/user × 1KB = 10GB/day traffic. Can one server handle it? Yes. |
| Single point of failure | Every component that isn't redundant is a SPOF | One database, one load balancer, one DNS provider = downtime when any fails |
| Stateless > Stateful | Servers that hold no session state are trivially scalable | Move state to external stores (Redis, DB). Any server can handle any request. |
2. Scaling — Vertical vs Horizontal
Vertical scaling (bigger server) is simpler and should be your first move. A single modern server with 64 cores, 256GB RAM, and NVMe SSDs handles more traffic than most startups will ever see. Don't add complexity until you need it.
| Vertical (Scale Up) | Horizontal (Scale Out) | |
|---|---|---|
| How | More CPU, RAM, disk on one machine | More machines behind a load balancer |
| Complexity | Zero code changes | Requires stateless design, distributed systems knowledge |
| Limit | Hardware ceiling (~256 cores, ~24TB RAM) | Practically unlimited |
| Cost curve | Exponential (2x specs = 3-4x cost) | Linear (2x servers = 2x cost) |
| When to use | < 10K RPS, databases, caches | > 10K RPS, stateless app servers, read replicas |
Scaling progression (what we recommend to clients):
Stage 1: Single Server (0-1K users)
App + DB + Cache on one machine
Cost: $50-200/mo
Stage 2: Separate DB (1K-10K users)
App Server <--> Database Server + Redis
Cost: $200-500/mo
Stage 3: Load Balanced (10K-100K users)
LB --> [App1, App2, App3] --> DB Primary + Read Replica + Redis
Cost: $500-2K/mo
Stage 4: Full Distribution (100K+ users)
CDN --> LB --> [App Pool] --> [DB Cluster + Sharding] + [Cache Cluster]
--> [Message Queue] --> [Worker Pool]
Cost: $2K-20K/mo
3. Load Balancing Strategies
A load balancer distributes incoming requests across multiple servers. The algorithm you choose affects performance, session handling, and failure behavior.
| Algorithm | How It Works | Best For | Watch Out |
|---|---|---|---|
| Round Robin | Request 1 → Server A, Request 2 → Server B, ... | Uniform request cost, stateless servers | Ignores server health and load |
| Least Connections | Send to server with fewest active connections | Variable request duration (websockets, uploads) | New servers get hammered initially |
| Weighted Round Robin | Bigger servers get proportionally more requests | Mixed hardware, gradual canary deploys | Weights need manual tuning |
| IP Hash | Same client IP always goes to same server | Sticky sessions without cookies | Uneven distribution behind NAT/proxy |
| Consistent Hashing | Hash key determines server, minimal redistribution on changes | Cache clusters, database sharding | Need virtual nodes for even distribution |
# Nginx load balancer config — our standard setup
upstream app_servers {
least_conn; # or: round_robin, ip_hash
server app1.internal:3000 weight=3; # 8-core, gets 3x traffic
server app2.internal:3000 weight=2; # 4-core
server app3.internal:3000 weight=2; # 4-core
server app4.internal:3000 backup; # only when others are down
# Health checks
keepalive 32;
}
server {
listen 443 ssl http2;
location / {
proxy_pass http://app_servers;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
proxy_next_upstream error timeout http_503;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
}
}
4. Caching — The Biggest Performance Win
Caching is the single most effective way to improve system performance. A cache hit returns data in microseconds instead of milliseconds (database) or seconds (external API). We've reduced P99 latency from 800ms to 12ms for a client's product catalog by adding a two-layer cache.
| Cache Layer | Latency | Tool | What to Cache |
|---|---|---|---|
| Browser | 0ms (already loaded) | Cache-Control headers | Static assets, API responses |
| CDN | 5-50ms | CloudFlare, Fastly | Images, CSS/JS, public API responses |
| Application (in-process) | ~1µs | Node-cache, Guava, in-memory map | Config, feature flags, hot data |
| Distributed cache | 0.5-2ms | Redis, Memcached | Sessions, DB query results, computed data |
| Database query cache | 0.1-1ms | MySQL query cache, pg_stat | Repeated identical queries |
Cache Invalidation Strategies
Phil Karlton famously said there are only two hard things in computer science: cache invalidation and naming things. He was right about the first one.
| Strategy | How It Works | Trade-off |
|---|---|---|
| Cache-Aside | App checks cache first, falls back to DB, writes to cache | Cache miss = extra latency. Most common pattern — we use this by default. |
| Write-Through | Every write goes to cache AND DB simultaneously | Always consistent, but writes are slower |
| Write-Behind | Write to cache immediately, batch-write to DB later | Fast writes but risk data loss if cache crashes |
| TTL-based | Data expires after N seconds, re-fetched on next request | Simple but stale data for up to TTL duration |
// Cache-aside pattern (Node.js + Redis)
async function getProduct(id: string): Promise<Product> {
// 1. Check cache first
const cached = await redis.get(`product:${id}`);
if (cached) return JSON.parse(cached);
// 2. Cache miss — query database
const product = await db.query('SELECT * FROM products WHERE id = $1', [id]);
if (!product) throw new NotFoundError('Product', id);
// 3. Write to cache with TTL
await redis.setex(`product:${id}`, 3600, JSON.stringify(product));
return product;
}
// Invalidate on update
async function updateProduct(id: string, data: Partial<Product>): Promise<void> {
await db.query('UPDATE products SET ... WHERE id = $1', [id]);
await redis.del(`product:${id}`); // delete, don't update
}
5. Database Selection and Sharding
The database you choose limits your architecture more than any other decision. Here's how we think about it:
| Need | Database | Why |
|---|---|---|
| Transactions + relationships | PostgreSQL | ACID, JSON support, extensions. Our default choice for 90% of projects. |
| Flexible schema + documents | MongoDB | Schema-less, horizontal scaling. Good for content, catalogs, logs. |
| Key-value + caching | Redis | Sub-millisecond reads. Sessions, rate limiting, leaderboards. |
| Full-text search | Elasticsearch | Inverted index, relevance scoring, faceted search. |
| Time series data | TimescaleDB / InfluxDB | Optimized for write-heavy, time-ordered data. Metrics, IoT, logs. |
| Graph relationships | Neo4j | Traversal queries. Social networks, recommendations, fraud detection. |
| Massive write throughput | Cassandra / ScyllaDB | Write-optimized, multi-region. Messaging, activity feeds. |
Database Sharding
Sharding splits your database across multiple servers. Each shard holds a subset of the data. It's necessary when a single database can't handle your write volume or storage needs — but it adds enormous complexity. Don't shard until you've exhausted read replicas, query optimization, and caching.
| Shard Key Strategy | Example | Risk |
|---|---|---|
| Hash-based | hash(user_id) % num_shards | Even distribution, but re-sharding requires data migration |
| Range-based | Users A-M → Shard 1, N-Z → Shard 2 | Hot spots if distribution is uneven |
| Geography-based | US users → US shard, EU → EU shard | Cross-region queries are slow. Good for data residency. |
| Tenant-based | Each company gets its own shard | Large tenants become hot shards. Works well for multi-tenant SaaS. |
6. Message Queues and Async Processing
Message queues decouple producers from consumers. The user clicks "send email" — your API returns 200ms later while the actual email is processed asynchronously. This pattern turns a 3-second synchronous operation into a 200ms API response plus a background task.
Without queue (synchronous):
User → API → Generate PDF → Send Email → Log Analytics → Response
Total: 3200ms (user waits the whole time)
With queue (asynchronous):
User → API → Enqueue Tasks → Response (200ms)
↓
Queue → Worker 1: Generate PDF (1500ms)
→ Worker 2: Send Email (800ms)
→ Worker 3: Log Analytics (200ms)
When to use a message queue:
- Long-running tasks (PDF generation, image processing, report building)
- Third-party API calls that might be slow or fail (email, SMS, payment webhooks)
- Smoothing traffic spikes (queue absorbs burst, workers process at constant rate)
- Decoupling services so they can scale independently
For broker comparison, see our Event-Driven Architecture guide which covers Kafka, RabbitMQ, SQS, and Redis Streams in detail.
7. CDN and Edge Architecture
A CDN (Content Delivery Network) caches your content on servers worldwide. A user in Mumbai gets your site from a Mumbai edge server (5ms) instead of your US origin server (150ms). For a global audience, this is the single biggest performance improvement you can make.
| What to Cache at Edge | TTL | Invalidation |
|---|---|---|
| Static assets (JS, CSS, images) | 1 year (with content hash in filename) | New filename on deploy — no purge needed |
| Public API responses | 30-300 seconds | Purge API on data change |
| HTML pages | 60-3600 seconds (depends on freshness needs) | Stale-while-revalidate for best UX |
| User-specific data | DON'T cache at CDN | Use Cache-Control: private |
8. Design Walkthrough: URL Shortener at Scale
Let's apply these principles to a real system. We'll design a URL shortener that handles 100M URLs and 1B redirects per month.
Back-of-Envelope Math
- Writes: 100M URLs/month = ~40 URLs/second
- Reads: 1B redirects/month = ~400 reads/second (10:1 read-write ratio)
- Storage: 100M URLs × 500 bytes avg = 50GB/year
- Bandwidth: 400 RPS × 500 bytes = 200KB/s (trivial)
This is modest. A single PostgreSQL instance handles this easily. The bottleneck is redirect latency — every millisecond of delay is a bad user experience.
URL Shortener Architecture:
┌────────────┐
Client ───▶│ CloudFlare │ (CDN cache for popular URLs)
└─────┬──────┘
│ cache miss
┌─────▼──────┐
│ Load Balancer│
└─────┬──────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│ App 1│ │ App 2│ │ App 3│ (stateless API servers)
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
└─────────┼─────────┘
│
┌───────┴───────┐
▼ ▼
┌──────┐ ┌──────────┐
│ Redis │ │ PostgreSQL│
│(cache)│ │ (source) │
└──────┘ └──────────┘
Redirect flow: CDN (80% hit) → Redis (15%) → PostgreSQL (5%)
P50 latency: 5ms (CDN hit), P99: 25ms
Key Design Decisions
- Short code generation: Base62 encoding of auto-increment ID. 6 characters = 56B combinations. No collision checking needed.
- Cache hot URLs at CDN: The top 20% of URLs account for 80% of traffic. CDN handles most redirects without hitting your servers.
- Redis for warm URLs: Cache the most recent 10M URLs in Redis (5GB). Sub-millisecond lookups.
- Analytics via Kafka: Log every redirect to Kafka. A separate consumer aggregates click stats. Never slow down the redirect path for analytics.
9. Common System Design Mistakes
| Mistake | Why It Happens | What to Do Instead |
|---|---|---|
| Premature microservices | "Netflix does it" | Start with a monolith. Extract when you hit specific scaling or team bottlenecks. |
| Sharding too early | Fear of future scale | Read replicas + caching handle 10x more than you think. Shard when you must. |
| No caching strategy | "We'll add caching later" | Design cache keys from day one. Retrofitting is 10x harder. |
| Synchronous everything | Simpler to reason about | Any operation over 200ms should be async. Users shouldn't wait for emails to send. |
| Ignoring failure modes | "It probably won't fail" | Every external call fails eventually. Implement timeouts, retries, circuit breakers. |
| Single database for everything | Simplicity | Use the right DB for each job: PostgreSQL for transactions, Redis for cache, ES for search. |
10. Frequently Asked Questions
How many users can a single server handle?
More than most people think. A well-optimized Node.js or Go server on a 4-core machine handles 5,000-10,000 requests per second. With proper caching and database optimization, that's 50,000-100,000 daily active users. Don't over-engineer until you're actually hitting limits — measure first.
When should I switch from a monolith to microservices?
When specific problems demand it: teams are blocking each other on deployments, one module needs different scaling than others, or the codebase is too large for any developer to understand. Don't switch because of hype. Our article on Microservices vs Monolith covers the decision framework in detail.
SQL or NoSQL — how do I decide?
Default to PostgreSQL. It handles 90% of use cases with transactions, JSON support, full-text search, and excellent performance. Use NoSQL when you have specific needs: DynamoDB for serverless at scale, MongoDB for truly schema-less data, Redis for caching. The "SQL can't scale" myth died years ago — Instagram runs on PostgreSQL.
How do I handle real-time features like notifications or chat?
WebSockets for persistent connections (chat, live collaboration), Server-Sent Events for one-way updates (notifications, live feeds), and long-polling as a fallback. We use Socket.io or native WebSockets with Redis Pub/Sub for horizontal scaling. For push notifications, Firebase Cloud Messaging or AWS SNS handles the device delivery.
What's the cheapest way to run a scalable system?
One VPS ($20/month) with PostgreSQL, Redis, and Nginx handles more than most startups need. Add CloudFlare's free CDN tier. Use managed databases only when you can't afford DBA time. Serverless (Lambda/Cloud Functions) is cheapest at very low traffic but gets expensive at scale. We've run production systems serving 50K DAU on $100/month infrastructure.
We design and build scalable systems for startups and enterprises — from zero-to-one MVPs to platforms handling millions of requests. This guide reflects real architecture decisions we've made, including the ones that didn't work out the first time.