Ideas Engineered for Tomorrow
We Engineer Services & Solutions for Your Business Needs
Home About
Products
Services
Hire
Industries
Consulting
Partners
Articles Careers Contact
Software Development

Chaos Engineering: Breaking Things on Purpose (Before They Break You)

Your distributed system will fail. The question is whether you discover the failure mode at 3 PM during a planned experiment or at 3 AM during a customer-facing outage.

September 29, 2025 13 min read

We started running chaos experiments after a client's payment service went down for 45 minutes because a single Redis node ran out of memory. The service was "designed for high availability" with replicas, health checks, and auto-scaling. None of it worked as expected when the actual failure happened. That's the gap chaos engineering closes — the distance between what you think your system does during failure and what it actually does.

What Chaos Engineering Actually Is (Not Just "Kill Random Stuff")

Chaos engineering is disciplined experimentation on distributed systems. The key word is disciplined. You're not randomly pulling plugs. You're forming a hypothesis about how your system handles a specific failure, running a controlled experiment, and observing whether reality matches your expectations.

The scientific method, applied to infrastructure:

  1. Define steady state — What does "healthy" look like? (e.g., p99 latency under 200ms, error rate below 0.1%)
  2. Hypothesize — "If Redis primary fails, the replica promotes within 10 seconds and the app continues serving with no errors"
  3. Inject failure — Kill the Redis primary
  4. Observe — Did the failover happen? How long did it take? What happened to in-flight requests?
  5. Learn — Either your hypothesis was correct (great, you have evidence) or it wasn't (great, you found a bug before your customers did)

Both outcomes are wins. That's the mindset shift that makes chaos engineering work.

The Steady-State Hypothesis: Getting It Right

The steady-state hypothesis is where most teams mess up. "The system should be fine" is not a hypothesis. You need measurable, observable criteria.

Good vs. Bad Hypotheses

Bad Hypothesis Good Hypothesis Why It's Better
"The system handles node failure" "When 1 of 3 API pods is killed, p99 latency stays under 300ms and error rate stays below 0.5% for the 60 seconds during recovery" Specific numbers, specific timeframe, measurable via monitoring
"Database failover works" "When RDS primary fails, the replica promotes within 30 seconds. The app retries failed queries automatically. No user-facing errors after the first 5 seconds" Defines acceptable impact window, tests retry logic, not just DB failover
"The service degrades gracefully" "When the recommendation service returns errors, the product page renders within 2 seconds showing a default product list instead of personalized recommendations" Specific fallback behavior, not vague "graceful degradation"
How we write hypotheses: Start from the SLA. If you promise 99.9% uptime, your system can be down ~8.7 hours/year. A single failure should contribute no more than a few minutes. Work backwards from there to define acceptable impact duration and magnitude.

Types of Chaos Experiments

Start with the simple ones and work your way up. Seriously — we've seen teams jump straight to multi-AZ failure and learn nothing because they couldn't even handle a single pod restart.

Level 1: Infrastructure Failures

  • Pod/container kill — Does the deployment recover? How long until the replacement pod is healthy?
  • Node drain/termination — Do pods reschedule? Does the Kubernetes scheduler spread them correctly?
  • Disk fill — Does the app handle full disk gracefully? Do alerts fire?
  • CPU/memory stress — Do resource limits protect other pods on the node? Does auto-scaling kick in?

Level 2: Network Failures

  • Network latency injection — Add 500ms latency to database calls. Do timeouts and circuit breakers work?
  • Packet loss — 5% packet loss on inter-service calls. How do retries behave? Do they cascade?
  • DNS failure — Corrupt DNS resolution. Does the app cache DNS appropriately? Or does one DNS blip cascade to every service?
  • Network partition — Split the network between services. Does the system detect the split? How does it recover when the partition heals?

Level 3: Application Failures

  • Dependency outage — Kill a downstream service entirely. Does the caller degrade gracefully or cascade-fail?
  • Slow dependency — Make a dependency respond in 10 seconds instead of 100ms. This is worse than a hard failure because the caller's thread pool fills up waiting
  • Data corruption — Return malformed responses from a dependency. Does the caller validate input or blindly trust it?

Level 4: Multi-Component Failures

  • AZ failure simulation — Drain all nodes in one availability zone. Does the system stay up with reduced capacity?
  • Cascading failure — Kill a service that triggers retry storms in its callers. Do circuit breakers prevent the cascade?
  • Clock skew — Advance the clock on some nodes. Do JWT tokens expire unexpectedly? Do cron jobs fire twice?

Tools: What We've Used and What We Recommend

Tool Best For Platform Our Take
Litmus Chaos Kubernetes-native chaos K8s Our default for K8s environments. ChaosHub has pre-built experiments. Integrates with GitOps via CRDs
Chaos Mesh Comprehensive K8s chaos K8s More experiment types than Litmus (especially network chaos). Dashboard is nicer. CNCF sandbox
Gremlin Enterprise chaos (any infra) Any Best UI, best for non-K8s. But expensive ($5K+/year). Worth it for teams that need guardrails and compliance
AWS FIS AWS-native failures AWS Best for simulating AWS-specific failures (EC2 spot interruption, AZ outage). Limited to AWS
Toxiproxy Network-level chaos for testing Any Simple, lightweight TCP proxy that injects latency/errors. Perfect for integration tests and local development
Chaos Monkey (Netflix) Random instance termination AWS/K8s The original, but dated. Litmus and Chaos Mesh have surpassed it in capability

Running Game Days: The Practical Guide

A game day is a scheduled chaos experiment session where the team intentionally breaks things and practices incident response. Think of it as a fire drill for your infrastructure.

Before the Game Day

  1. Pick 3-5 experiments — Ranked by impact and likelihood. Don't try to test everything
  2. Define blast radius controls — What's the maximum impact you're willing to accept? At what point do you abort?
  3. Ensure observability — Can you actually see what's happening? Dashboards prepped, alerts tuned, logs accessible
  4. Notify stakeholders — Customer support, PMs, anyone who might see impact. "We're running chaos experiments from 2-4 PM, you might see brief latency spikes"
  5. Prepare rollback — Know exactly how to stop the experiment instantly

During the Game Day

  • One person runs the experiment, one person monitors dashboards, one person takes notes
  • Start with the mildest experiment. Build confidence before escalating
  • If anything unexpected happens, stop immediately. You've already learned something valuable
  • Time-box each experiment: 15-minute max

After the Game Day

  • Write up findings within 24 hours (memory fades fast)
  • For each experiment: hypothesis, actual result, gap analysis, action items
  • Prioritize fixes and assign owners. Chaos without follow-up is just breaking things
  • Share results broadly. Game day findings often reveal issues relevant to multiple teams

Chaos in Kubernetes: A Practical Example

Here's a real Litmus chaos experiment we run for clients' K8s deployments.

# litmus-pod-delete.yaml
# Experiment: Kill a random pod from the user-service deployment
# Hypothesis: HPA scales a replacement within 30s, p99 stays under 500ms
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: user-service-pod-delete
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=user-service
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: '30'      # Kill pods for 30 seconds
            - name: CHAOS_INTERVAL
              value: '10'      # Every 10 seconds
            - name: FORCE
              value: 'false'   # Graceful termination (test shutdown hooks)
            - name: PODS_AFFECTED_PERC
              value: '30'      # Kill 30% of pods (1 of 3)
        probe:
          - name: check-latency
            type: promProbe
            mode: Continuous
            promProbe/inputs:
              endpoint: http://prometheus:9090
              query: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="user-service"}[1m]))
              comparator:
                type: float
                criteria: '<='
                value: '0.5'   # p99 must stay under 500ms

The probe is the key part — it automatically checks whether p99 latency stayed within bounds during the experiment. If it breached 500ms, the experiment is marked as failed and you know you have a resilience gap to fix.

Chaos Maturity Model

Where is your organization on this spectrum? Be honest — most are at Level 1 or 2.

Level Description What You're Doing Next Step
0 — Reactive You learn about failures from incidents Nothing proactive. Post-mortems after outages Run one manual chaos experiment in staging
1 — Ad Hoc Occasional manual experiments Someone kills a pod occasionally, watches what happens Write proper hypotheses, use a chaos tool, document results
2 — Structured Regular game days with proper process Monthly game days, defined experiments, follow-up actions Run experiments in production (with blast radius controls)
3 — Automated Chaos experiments run continuously in CI/CD Every deploy triggers resilience tests. Failures block releases Expand scope: multi-AZ, dependency chains, data-layer chaos
4 — Cultural Resilience is a design requirement, not an afterthought Teams design for failure from day one. Chaos experiments are in every service's definition of done Share learnings publicly, contribute to chaos engineering community

What We Learned Breaking Our Own Systems

We run chaos experiments on our own cloud infrastructure and client environments. Some surprising findings:

  • Circuit breakers often aren't configured correctly. We found that 40% of the circuit breakers we tested either had thresholds too high (triggered after the system was already cascading) or reset too quickly (closed before the dependency actually recovered)
  • Health checks lie. A pod can pass its health check while being completely unable to serve traffic — because the health check endpoint doesn't exercise the same code path as real requests. Test health checks as part of your chaos experiments
  • Retry storms are the #1 cause of cascading failures. Service A fails, services B through G all retry simultaneously, overwhelming service A's recovery. Exponential backoff with jitter isn't optional — it's a requirement
  • DNS failures cause the widest blast radius. When we simulated DNS issues, more services broke than with any other failure type. And most teams had zero DNS resilience

Frequently Asked Questions

Should we run chaos experiments in production?

Eventually, yes — production is the only environment that truly represents your system. But start in staging until you've built confidence and tooling. Move to production only when you have proper blast radius controls, automated rollback, and observability to detect impact immediately.

How do we convince management that breaking things on purpose is a good idea?

Frame it as risk reduction, not destruction. Show the cost of your last outage (revenue lost, engineering hours, customer trust). Then explain that a 30-minute planned experiment costs 100x less than discovering the same issue during a real outage. Start with staging to build credibility.

What's the minimum we need before starting chaos engineering?

Observability (you need to see what's happening), a staging environment, and basic deployment automation. If you can't monitor your system or deploy changes quickly, fix those first. Chaos without observability is just breaking things blindly.

How often should we run chaos experiments?

Monthly game days for manual experiments. Continuously in CI/CD for automated resilience tests. After every major architecture change. The cadence matters less than consistency — a monthly game day you actually do beats a weekly one you keep postponing.

Pillai Infotech Engineering Team

We run chaos experiments on our own infrastructure and help clients build resilience practices. Our favourite finding: the system you think is fault-tolerant rarely is — until you've tested it.

Want to Test Your System's Resilience?

From your first game day to continuous chaos in CI/CD — we help teams build confidence in their systems through controlled failure.

Get a Free Resilience Assessment Cloud & DevOps Services