Chaos Engineering @ Velocity

Talk given by @Ana_M_Medina from Gremlin.

What is Chaos Engineering?

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Where we would try to:

Inject something harmful to build an immunity

Why do it?

Prerequisites

  1. Dashboard and metrics (New Relic, Kibana)
  2. Alerting
  3. What is our cost per hour of outage?

Principles of Chaos Engineering

These experiments follow four steps:

  1. Start by defining ‘steady state’ as some measurable output of a system that indicates normal behavior.
  2. Hypothesize that this steady state will continue in both the control group and the experimental group.
  3. Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.
  4. Try to disprove the hypothesis by looking for a difference in steady state between the control group and the experimental group.

Don’t approach it with a random strategy, instead approach it like a scientific experiment, thoughtful and planned.

What experiments can you run?

However, ensure we restrict the blast radius of our experiments.

And after each experiment, reiterate the experiment to see if we are resilient to it.

Tools

Tools that could be used:

Other

An interesting link that was shared during the talk was github.com/danluu/post-mortems. A repo of a number of post mortem of large company outages.

rss facebook twitter github gitlab youtube mail spotify lastfm instagram linkedin google google-plus pinterest medium vimeo stackoverflow reddit quora quora