On this page
What Chaos Engineering Is (and Is Not)
The Four-Step Experiment Loop
Step 1: Define Steady State
Step 2: Form a Hypothesis
Step 3: Inject Failure
Step 4: Observe and Learn
Five Experiments Every Team Should Start With
Experiment 1: Kill a Service Instance
Experiment 2: Inject Network Latency
Experiment 3: Fill the Disk
Experiment 4: DNS Failure
Experiment 5: Kill the Cache
Tools for Chaos Engineering
For Kubernetes Environments
For Cloud Environments
For Simple Starts
GameDays: Making Chaos Engineering a Team Practice
The Safety Checklist
How This Shows Up in System Design Interviews
Common Mistakes
Conclusion: Key Takeaways
Chaos Engineering for Beginners: How to Break Your System Before Production Does


On This Page
What Chaos Engineering Is (and Is Not)
The Four-Step Experiment Loop
Step 1: Define Steady State
Step 2: Form a Hypothesis
Step 3: Inject Failure
Step 4: Observe and Learn
Five Experiments Every Team Should Start With
Experiment 1: Kill a Service Instance
Experiment 2: Inject Network Latency
Experiment 3: Fill the Disk
Experiment 4: DNS Failure
Experiment 5: Kill the Cache
Tools for Chaos Engineering
For Kubernetes Environments
For Cloud Environments
For Simple Starts
GameDays: Making Chaos Engineering a Team Practice
The Safety Checklist
How This Shows Up in System Design Interviews
Common Mistakes
Conclusion: Key Takeaways
What This Blog Covers
- What chaos engineering is and why it matters
- The four-step experiment loop
- Five experiments every team should start with
- Tools for running chaos experiments
- How to discuss chaos engineering in system design interviews
It is 3 AM. Your payment service is down.
The on-call engineer checks the dashboard: the database is healthy, the application servers are running, CPU and memory look fine.
Everything looks green. But the service is not responding.
After 45 minutes of investigation, the engineer discovers that the connection pool between the payment service and the database is exhausted.
A downstream service that normally responds in 10ms started responding in 5 seconds due to a slow query, and the payment service's threads are all waiting for responses.
The connection pool filled up, new requests cannot get a connection, and the entire service is effectively dead while every individual component reports healthy.
This failure mode was completely predictable.
But no one tested for it because traditional testing only validates that things work when everything is working. It does not validate what happens when one thing is slow, one thing is unreachable, or one thing returns unexpected errors.
Chaos engineering fills this gap. It is the practice of deliberately injecting failures into a system to discover weaknesses before they cause real outages.
Instead of waiting for 3 AM to find out that your connection pool does not handle slow downstream services, you simulate a slow downstream service at 2 PM on a Tuesday, observe the behavior, and fix the vulnerability before it becomes an incident.
This guide covers the methodology, the experiments you should start with, the tools available, and how chaos engineering shows up in system design interviews.
What Chaos Engineering Is (and Is Not)
Chaos engineering is not randomly breaking things to see what happens. It is a disciplined, scientific approach to building confidence in your system's resilience.
The discipline was pioneered by Netflix in 2010 when they built Chaos Monkey, a tool that randomly terminated production server instances during business hours.
The reasoning was simple: if Netflix's streaming service could survive random server failures, it could survive the real failures that inevitably happen in cloud environments.
The key distinction is between chaos engineering and breaking things.
Breaking things is uncontrolled destruction with no hypothesis and no measurement.
Chaos engineering is a controlled experiment with a specific hypothesis ("our system should survive the loss of one database replica"), a controlled injection (terminate one replica), careful observation (monitor error rate, latency, and user impact), and a concrete outcome (either the hypothesis holds, or you found a weakness to fix).
For understanding the resilience patterns that chaos engineering validates, A Beginner's Guide to Retry, Circuit Breaker, and Timeout Patterns covers the patterns you are testing when you run chaos experiments.
The Four-Step Experiment Loop
Every chaos experiment follows the same four-step process. This is the scientific method applied to system reliability.
Step 1: Define Steady State
Before you can test what happens during a failure, you need to define what "normal" looks like.
Steady state is measured by the metrics that indicate your system is healthy: P99 latency under 200ms, error rate under 0.1%, throughput at 5,000 requests per second, all health checks passing.
Document these baseline metrics before running any experiment.
If you do not know what normal looks like, you cannot detect when something is abnormal.
Step 2: Form a Hypothesis
A chaos experiment starts with a hypothesis about what should happen during a failure.
Good hypotheses are specific and testable.
"When we terminate one of three application servers, the load balancer should route traffic to the remaining two. P99 latency should increase by no more than 50%, and error rate should stay below 1%."
"When we inject 500ms of network latency between the API service and the database, the circuit breaker should open within 10 seconds and the API should serve cached responses. User-facing error rate should remain below 0.5%."
"When we kill the Redis cache, the system should fall back to database reads. Latency should increase but the system should remain functional. No data should be lost."
A bad hypothesis is vague: "The system should handle failures gracefully."
What does "gracefully" mean?
What failure?
What metrics define success?
Without specifics, you cannot determine whether the experiment passed or failed.
Step 3: Inject Failure
Run the experiment by introducing the specific failure described in your hypothesis. This is where chaos engineering tools come in (covered in the tools section below).
The critical safety rules during injection: limit the blast radius (affect one server, not all servers), have an abort mechanism (a kill switch that stops the experiment immediately), inform the team (everyone on-call should know an experiment is running), and start in staging before moving to production.
Step 4: Observe and Learn
Compare the system's actual behavior during the experiment to your hypothesis.
Three outcomes are possible.
- Hypothesis confirmed: The system behaved as expected. You have increased confidence in your resilience. Document the result and move on to the next experiment.
- Hypothesis falsified: The system did not behave as expected. You found a weakness. This is the most valuable outcome. Fix the weakness, then re-run the experiment to verify the fix.
- Unexpected behavior: Something happened that you did not predict at all. The payment service crashed, but the monitoring system also went down because it depended on the same database. This is the category of discovery that only chaos engineering can provide, because you could not have written a test for a failure mode you did not know existed.
Five Experiments Every Team Should Start With
These five experiments cover the most common failure modes in distributed systems.
Start with experiment 1 and progress in order.
Experiment 1: Kill a Service Instance
What you test: Does the load balancer correctly route traffic away from a dead instance? Does the system maintain availability with reduced capacity?
How to run it: Terminate one application server instance. Monitor traffic distribution, error rate, and latency for the next 5 minutes.
What you expect: Traffic redistributes to healthy instances within seconds. Error rate stays below 1%. Latency increases proportionally to the reduced capacity.
Common failure found: The load balancer's health check interval is too long (30 seconds), causing requests to be routed to the dead instance for up to 30 seconds after it crashes. Fix: reduce health check interval to 5 seconds with a 2-check failure threshold.
Experiment 2: Inject Network Latency
What you test: Do timeouts, circuit breakers, and retry policies work correctly when a downstream service becomes slow?
How to run it: Add 500ms to 2 seconds of latency to all traffic between two services. Monitor the calling service's behavior.
What you expect: The circuit breaker opens after detecting sustained latency. The calling service returns degraded responses (cached data, partial results) instead of timing out.
Common failure found: No circuit breaker is configured. The calling service waits for the full timeout (often 30 seconds) on every request, consuming thread pool resources and eventually becoming unresponsive itself. This is how a slow service causes a cascading failure. For understanding this cascading failure pattern in depth, System Design Interview: How to Prevent Cascading Failures explains the mechanics.
Experiment 3: Fill the Disk
What you test: Does the application handle disk exhaustion gracefully? Do alerts fire before the disk is completely full?
How to run it: Write large files to fill the disk to 95% capacity on one server. Monitor application behavior and alerting.
What you expect: Monitoring alerts fire at 80% disk usage. The application logs an error when writes fail but does not crash. Other services are not affected.
Common failure found: The application crashes with an unhandled exception when a log file write fails. No alert exists for disk usage, so the team only learns about it when the application goes down. Fix: handle write errors gracefully, add disk usage alerting at 80%, and implement log rotation.
Experiment 4: DNS Failure
What you test: What happens when DNS resolution fails for a downstream service? Does the application use cached DNS entries? Does it retry with backoff?
How to run it: Block DNS resolution for one downstream service for 30 seconds. Monitor the calling service's behavior.
What you expect: The application uses cached DNS entries for the duration of the failure. If no cached entries exist, it retries with exponential backoff and returns a degraded response after exhausting retries.
Common failure found: The application has no DNS caching and makes a fresh DNS lookup for every request. When DNS fails, every request fails immediately. Fix: configure local DNS caching with a reasonable TTL and implement retry logic for DNS resolution failures.
Experiment 5: Kill the Cache
What you test: What happens when Redis or Memcached goes down? Does the application fall back to database reads? Does the database handle the sudden load increase?
How to run it: Terminate the Redis instance. Monitor application latency, database load, and error rate.
What you expect: The application detects the cache failure and falls back to direct database reads. Latency increases but the system remains functional. No data is lost.
Common failure found: The application treats cache failures as hard errors and returns 500 to all users, even though the data exists in the database. Or the database is overwhelmed by the sudden traffic that the cache was absorbing, leading to a cascading failure. For understanding the broader set of reliability techniques to prevent these cascading issues, 8 Techniques for Building Reliable Distributed Systems covers the resilience patterns.
Tools for Chaos Engineering
For Kubernetes Environments
Chaos Mesh is a CNCF project that runs natively on Kubernetes. It can inject pod failures, network delays, DNS errors, I/O faults, and kernel-level chaos. Configuration is done through Kubernetes custom resources, making it easy to integrate with GitOps workflows.
LitmusChaos is another CNCF project with a large library of pre-built experiments. It includes a central control plane for managing experiments across clusters and has strong integration with CI/CD pipelines.
For Cloud Environments
AWS Fault Injection Simulator (FIS) integrates directly with AWS services. It can stop EC2 instances, throttle API calls, inject latency into network traffic, and simulate AZ failures. The tight AWS integration makes it the easiest option for AWS-heavy teams.
Gremlin is a commercial platform that works across cloud providers and on-premises infrastructure. It provides a SaaS interface with pre-built attack types, safety controls, and team collaboration features. It is the most beginner-friendly option for teams new to chaos engineering.
For Simple Starts
tc (traffic control) is a Linux command-line tool that can add network latency, packet loss, and bandwidth limits. No installation required. It is the simplest way to start experimenting with network chaos on a single server.
kill and docker stop are the simplest chaos tools. Terminate a process or container and watch what happens. No framework needed. This is a perfectly valid way to run your first chaos experiment.
GameDays: Making Chaos Engineering a Team Practice
A GameDay is a structured event where the team runs chaos experiments together, typically lasting 2 to 4 hours. It is the most effective way to build chaos engineering culture and discover cross-team failure modes.
A typical GameDay follows this structure.
The team gathers (in person or virtually) and reviews the day's objectives: which systems are being tested, what hypotheses are being validated, and who is responsible for what.
The coordinator runs the experiments one at a time, with observers monitoring dashboards and taking notes. After each experiment, the team debriefs: what happened, did it match the hypothesis, and what needs to be fixed.
GameDays are valuable because they expose failure modes that individual engineers miss.
The database team might know that failover takes 30 seconds, but the application team might not know that their connection pool does not retry after a connection drop.
The cross-team observation reveals gaps that exist between teams, not within them.
Run GameDays quarterly at minimum. Schedule them during business hours, not at night.
The goal is to build confidence and discover issues, not to create heroic midnight debugging sessions. Document every finding and track fixes as engineering tickets with deadlines.
The Safety Checklist
Before running any chaos experiment, verify these conditions.
- Observability is in place: You cannot learn from an experiment you cannot observe. Ensure you have dashboards showing latency, error rate, and throughput for every service involved. Distributed tracing should be operational so you can follow the impact of the failure through the service chain.
- Blast radius is limited: Start with a single instance, a single service, or a single availability zone. Never inject failures across your entire system in your first experiments. Expand the blast radius only after you have confidence in your safety mechanisms.
- Abort mechanism exists: Every experiment needs a kill switch that immediately stops the injection and returns the system to normal. This should be a single button or command, not a multi-step procedure.
- The team is informed: Everyone on-call should know that a chaos experiment is running, what the expected behavior is, and who to contact if something unexpected happens. Surprise chaos experiments at 3 AM are not chaos engineering. They are chaos.
- Start in staging: Run experiments in a non-production environment first. Move to production only when you are confident in your safety mechanisms and when you have validated the experiment's behavior in staging. For understanding the broader framework for preventing the failures that chaos engineering uncovers, System Design Interview Guide: How to Prevent Cascading Failures covers the architectural patterns.
How This Shows Up in System Design Interviews
Chaos engineering comes up when interviewers ask about reliability, fault tolerance, or how you would ensure a system handles failures. Mentioning it proactively signals operational maturity.
Here is how to present it:
"For reliability, I would implement chaos engineering practices. Starting in staging, I would run experiments that kill individual service instances to verify load balancer failover, inject network latency between services to validate circuit breakers and timeouts, and terminate the cache to ensure graceful fallback to the database. Each experiment has a specific hypothesis, for example: 'When we add 500ms of latency to the inventory service, the order service's circuit breaker should open within 10 seconds and return cached inventory data.' I would use Chaos Mesh on Kubernetes and run these experiments weekly, expanding to production once the team has confidence in the safety mechanisms."
That answer covers the methodology (experiments with hypotheses), specific experiment types (instance kill, latency injection, cache failure), the expected behavior (circuit breaker, graceful fallback), the tooling (Chaos Mesh), and the cadence (weekly, staging first).
For beginners, the System Design Fundamentals course covers fundamental concepts.
Common Mistakes
-
Running chaos experiments without observability: If you inject a failure but cannot see how the system responds (because you lack dashboards, tracing, or alerting), the experiment teaches you nothing. Observability comes first. Chaos engineering comes second.
-
Starting in production on day one: Your first chaos experiment should be in a staging environment where the consequences of unexpected behavior are limited. Move to production after you have validated your safety mechanisms and built team confidence.
-
No hypothesis: "Let's kill a server and see what happens" is not chaos engineering. It is just breaking things. Every experiment needs a specific, testable hypothesis that defines what success looks like. Without a hypothesis, you cannot determine whether the experiment passed or failed.
-
Injecting too many failures at once: Kill one instance, not five. Inject latency on one network path, not all paths. If you inject multiple failures simultaneously, you cannot determine which failure caused which behavior. Isolate variables, just like any scientific experiment.
-
Not fixing what you find: The most expensive chaos engineering mistake is running experiments, discovering weaknesses, and not fixing them. A chaos experiment that reveals a missing circuit breaker is valuable only if you actually implement the circuit breaker. Track findings as engineering tickets and prioritize them alongside feature work.
-
Treating chaos engineering as a one-time activity: Systems change. New services are added, configurations are updated, and dependencies evolve. An experiment that passed three months ago might fail today because a timeout was changed or a new dependency was introduced. Run experiments regularly, at least monthly, and re-run them after significant deployments. The most mature teams integrate chaos experiments into their CI/CD pipeline, running a basic suite of experiments on every release to catch regressions automatically. This continuous validation is what separates teams that are resilient by design from teams that are resilient by luck.
Conclusion: Key Takeaways
-
Chaos engineering is controlled, scientific experimentation, not random destruction. Every experiment has a hypothesis, a controlled injection, careful observation, and a concrete outcome.
-
Follow the four-step loop: define steady state, form hypothesis, inject failure, observe and learn. This process is repeatable and works for any failure mode.
-
Start with five foundational experiments. Kill an instance, inject latency, fill the disk, break DNS, and kill the cache. These cover the most common failure modes in distributed systems.
-
Safety is non-negotiable. Limit blast radius, have an abort mechanism, inform the team, and start in staging. Chaos engineering that causes uncontrolled outages is a failure of the practice, not a success.
-
Observability must come before chaos engineering. You cannot learn from experiments you cannot observe. Ensure dashboards, tracing, and alerting are operational before injecting failures.
-
In interviews, describe specific experiments with specific hypotheses. Naming the failure injection, the expected behavior, and the tooling demonstrates practical experience with reliability engineering.
What our users say
Matzuk
Algorithms can be daunting, but they're less so with the right guide. This course - https://www.designgurus.io/course/grokking-the-coding-interview, is a great starting point. It covers typical problems you might encounter in interviews.
Arijeet
Just completed the “Grokking the system design interview”. It's amazing and super informative. Have come across very few courses that are as good as this!
AHMET HANIF
Whoever put this together, you folks are life savers. Thank you :)
Access to 50+ courses
New content added monthly
Certificate of completion
$29.08
/month
Billed Annually
Recommended Course

Grokking the Object Oriented Design Interview
59,054+ students
3.9
Learn how to prepare for object oriented design interviews and practice common object oriented design interview questions. Master low level design interview.
View Course