Chaos Engineering @ Velocity

Wednesday. October 31, 2018

Talk given by @Ana_M_Medina from Gremlin.

What is Chaos Engineering?

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Where we would try to:

Inject something harmful to build an immunity

These experiments follow four steps:

Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
Hypothesize that this steady state will continue in both the control group and the experimental group.
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Don’t approach it with a random strategy, instead approach it like a scientific experiment, thoughtful and planned.

However, ensure we restrict the blast radius of our experiments.

And after each experiment, reiterate the experiment to see if we are resilient to it.

Tools that could be used:

An interesting link that was shared during the talk was github.com/danluu/post-mortems. A repo of a number of post mortem of large company outages.