Chaos Engineering @ Velocity
Talk given by @Ana_M_Medina from Gremlin.
What is Chaos Engineering?
Thoughtful, planned experiments designed to reveal the weakness in our systems.
Where we would try to:
Inject something harmful to build an immunity
Why do it?
- Downtime is really expensive
- Our dependencies will fail
- Pager fatigue
- Reveal weak points in your systems
- Dashboard and metrics (New Relic, Kibana)
- What is our cost per hour of outage?
These experiments follow four steps:
- Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
- Hypothesize that this steady state will continue in both the control group and the experimental group.
- Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
- Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.
Don’t approach it with a random strategy, instead approach it like a scientific experiment, thoughtful and planned.
What experiments can you run?
- Reproduce outage conditions
- Unpredictable circumstances
- Large traffic spikes
- Race conditions
- Datacenter failure
- Time travel - system clocks to be out of sync
- Network errors
- CPU overloads
However, ensure we restrict the blast radius of our experiments.
And after each experiment, reiterate the experiment to see if we are resilient to it.
Tools that could be used:
- Chaos Monkey; https://github.com/Netflix/chaosmonkey
- Simian Army: https://github.com/Netflix/SimianArmy
- Litmus: github.com/openebs/Litmus
- Powerful Seal: https://github.com/bloomberg/powerfulseal
An interesting link that was shared during the talk was github.com/danluu/post-mortems. A repo of a number of post mortem of large company outages.