We started running chaos experiments after a client's payment service went down for 45 minutes because a single Redis node ran out of memory. The service was "designed for high availability" with replicas, health checks, and auto-scaling. None of it worked as expected when the actual failure happened. That's the gap chaos engineering closes — the distance between what you think your system does during failure and what it actually does.
What We'll Cover
What Chaos Engineering Actually Is (Not Just "Kill Random Stuff")
Chaos engineering is disciplined experimentation on distributed systems. The key word is disciplined. You're not randomly pulling plugs. You're forming a hypothesis about how your system handles a specific failure, running a controlled experiment, and observing whether reality matches your expectations.
The scientific method, applied to infrastructure:
- Define steady state — What does "healthy" look like? (e.g., p99 latency under 200ms, error rate below 0.1%)
- Hypothesize — "If Redis primary fails, the replica promotes within 10 seconds and the app continues serving with no errors"
- Inject failure — Kill the Redis primary
- Observe — Did the failover happen? How long did it take? What happened to in-flight requests?
- Learn — Either your hypothesis was correct (great, you have evidence) or it wasn't (great, you found a bug before your customers did)
Both outcomes are wins. That's the mindset shift that makes chaos engineering work.
The Steady-State Hypothesis: Getting It Right
The steady-state hypothesis is where most teams mess up. "The system should be fine" is not a hypothesis. You need measurable, observable criteria.
Good vs. Bad Hypotheses
| Bad Hypothesis | Good Hypothesis | Why It's Better |
|---|---|---|
| "The system handles node failure" | "When 1 of 3 API pods is killed, p99 latency stays under 300ms and error rate stays below 0.5% for the 60 seconds during recovery" | Specific numbers, specific timeframe, measurable via monitoring |
| "Database failover works" | "When RDS primary fails, the replica promotes within 30 seconds. The app retries failed queries automatically. No user-facing errors after the first 5 seconds" | Defines acceptable impact window, tests retry logic, not just DB failover |
| "The service degrades gracefully" | "When the recommendation service returns errors, the product page renders within 2 seconds showing a default product list instead of personalized recommendations" | Specific fallback behavior, not vague "graceful degradation" |
Types of Chaos Experiments
Start with the simple ones and work your way up. Seriously — we've seen teams jump straight to multi-AZ failure and learn nothing because they couldn't even handle a single pod restart.
Level 1: Infrastructure Failures
- Pod/container kill — Does the deployment recover? How long until the replacement pod is healthy?
- Node drain/termination — Do pods reschedule? Does the Kubernetes scheduler spread them correctly?
- Disk fill — Does the app handle full disk gracefully? Do alerts fire?
- CPU/memory stress — Do resource limits protect other pods on the node? Does auto-scaling kick in?
Level 2: Network Failures
- Network latency injection — Add 500ms latency to database calls. Do timeouts and circuit breakers work?
- Packet loss — 5% packet loss on inter-service calls. How do retries behave? Do they cascade?
- DNS failure — Corrupt DNS resolution. Does the app cache DNS appropriately? Or does one DNS blip cascade to every service?
- Network partition — Split the network between services. Does the system detect the split? How does it recover when the partition heals?
Level 3: Application Failures
- Dependency outage — Kill a downstream service entirely. Does the caller degrade gracefully or cascade-fail?
- Slow dependency — Make a dependency respond in 10 seconds instead of 100ms. This is worse than a hard failure because the caller's thread pool fills up waiting
- Data corruption — Return malformed responses from a dependency. Does the caller validate input or blindly trust it?
Level 4: Multi-Component Failures
- AZ failure simulation — Drain all nodes in one availability zone. Does the system stay up with reduced capacity?
- Cascading failure — Kill a service that triggers retry storms in its callers. Do circuit breakers prevent the cascade?
- Clock skew — Advance the clock on some nodes. Do JWT tokens expire unexpectedly? Do cron jobs fire twice?
Tools: What We've Used and What We Recommend
| Tool | Best For | Platform | Our Take |
|---|---|---|---|
| Litmus Chaos | Kubernetes-native chaos | K8s | Our default for K8s environments. ChaosHub has pre-built experiments. Integrates with GitOps via CRDs |
| Chaos Mesh | Comprehensive K8s chaos | K8s | More experiment types than Litmus (especially network chaos). Dashboard is nicer. CNCF sandbox |
| Gremlin | Enterprise chaos (any infra) | Any | Best UI, best for non-K8s. But expensive ($5K+/year). Worth it for teams that need guardrails and compliance |
| AWS FIS | AWS-native failures | AWS | Best for simulating AWS-specific failures (EC2 spot interruption, AZ outage). Limited to AWS |
| Toxiproxy | Network-level chaos for testing | Any | Simple, lightweight TCP proxy that injects latency/errors. Perfect for integration tests and local development |
| Chaos Monkey (Netflix) | Random instance termination | AWS/K8s | The original, but dated. Litmus and Chaos Mesh have surpassed it in capability |
Running Game Days: The Practical Guide
A game day is a scheduled chaos experiment session where the team intentionally breaks things and practices incident response. Think of it as a fire drill for your infrastructure.
Before the Game Day
- Pick 3-5 experiments — Ranked by impact and likelihood. Don't try to test everything
- Define blast radius controls — What's the maximum impact you're willing to accept? At what point do you abort?
- Ensure observability — Can you actually see what's happening? Dashboards prepped, alerts tuned, logs accessible
- Notify stakeholders — Customer support, PMs, anyone who might see impact. "We're running chaos experiments from 2-4 PM, you might see brief latency spikes"
- Prepare rollback — Know exactly how to stop the experiment instantly
During the Game Day
- One person runs the experiment, one person monitors dashboards, one person takes notes
- Start with the mildest experiment. Build confidence before escalating
- If anything unexpected happens, stop immediately. You've already learned something valuable
- Time-box each experiment: 15-minute max
After the Game Day
- Write up findings within 24 hours (memory fades fast)
- For each experiment: hypothesis, actual result, gap analysis, action items
- Prioritize fixes and assign owners. Chaos without follow-up is just breaking things
- Share results broadly. Game day findings often reveal issues relevant to multiple teams
Chaos in Kubernetes: A Practical Example
Here's a real Litmus chaos experiment we run for clients' K8s deployments.
# litmus-pod-delete.yaml
# Experiment: Kill a random pod from the user-service deployment
# Hypothesis: HPA scales a replacement within 30s, p99 stays under 500ms
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: user-service-pod-delete
namespace: production
spec:
appinfo:
appns: production
applabel: app=user-service
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: '30' # Kill pods for 30 seconds
- name: CHAOS_INTERVAL
value: '10' # Every 10 seconds
- name: FORCE
value: 'false' # Graceful termination (test shutdown hooks)
- name: PODS_AFFECTED_PERC
value: '30' # Kill 30% of pods (1 of 3)
probe:
- name: check-latency
type: promProbe
mode: Continuous
promProbe/inputs:
endpoint: http://prometheus:9090
query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="user-service"}[1m]))
comparator:
type: float
criteria: '<='
value: '0.5' # p99 must stay under 500ms
The probe is the key part — it automatically checks whether p99 latency stayed within bounds during the experiment. If it breached 500ms, the experiment is marked as failed and you know you have a resilience gap to fix.
Chaos Maturity Model
Where is your organization on this spectrum? Be honest — most are at Level 1 or 2.
| Level | Description | What You're Doing | Next Step |
|---|---|---|---|
| 0 — Reactive | You learn about failures from incidents | Nothing proactive. Post-mortems after outages | Run one manual chaos experiment in staging |
| 1 — Ad Hoc | Occasional manual experiments | Someone kills a pod occasionally, watches what happens | Write proper hypotheses, use a chaos tool, document results |
| 2 — Structured | Regular game days with proper process | Monthly game days, defined experiments, follow-up actions | Run experiments in production (with blast radius controls) |
| 3 — Automated | Chaos experiments run continuously in CI/CD | Every deploy triggers resilience tests. Failures block releases | Expand scope: multi-AZ, dependency chains, data-layer chaos |
| 4 — Cultural | Resilience is a design requirement, not an afterthought | Teams design for failure from day one. Chaos experiments are in every service's definition of done | Share learnings publicly, contribute to chaos engineering community |
What We Learned Breaking Our Own Systems
We run chaos experiments on our own cloud infrastructure and client environments. Some surprising findings:
- Circuit breakers often aren't configured correctly. We found that 40% of the circuit breakers we tested either had thresholds too high (triggered after the system was already cascading) or reset too quickly (closed before the dependency actually recovered)
- Health checks lie. A pod can pass its health check while being completely unable to serve traffic — because the health check endpoint doesn't exercise the same code path as real requests. Test health checks as part of your chaos experiments
- Retry storms are the #1 cause of cascading failures. Service A fails, services B through G all retry simultaneously, overwhelming service A's recovery. Exponential backoff with jitter isn't optional — it's a requirement
- DNS failures cause the widest blast radius. When we simulated DNS issues, more services broke than with any other failure type. And most teams had zero DNS resilience
Frequently Asked Questions
Should we run chaos experiments in production?
Eventually, yes — production is the only environment that truly represents your system. But start in staging until you've built confidence and tooling. Move to production only when you have proper blast radius controls, automated rollback, and observability to detect impact immediately.
How do we convince management that breaking things on purpose is a good idea?
Frame it as risk reduction, not destruction. Show the cost of your last outage (revenue lost, engineering hours, customer trust). Then explain that a 30-minute planned experiment costs 100x less than discovering the same issue during a real outage. Start with staging to build credibility.
What's the minimum we need before starting chaos engineering?
Observability (you need to see what's happening), a staging environment, and basic deployment automation. If you can't monitor your system or deploy changes quickly, fix those first. Chaos without observability is just breaking things blindly.
How often should we run chaos experiments?
Monthly game days for manual experiments. Continuously in CI/CD for automated resilience tests. After every major architecture change. The cadence matters less than consistency — a monthly game day you actually do beats a weekly one you keep postponing.