Chaos testing – In today’s world, the focus is rapidly shifting from MTTF ( Mean time to failure) to MTTR (Mean time to recovery). As a simple example, which do you prefer? A system that goes out multiple times in a day but the recovery is so fast the user experience is seamless or one that goes out only once in a few months but the complete system is down for a few hours?
What is chaos testing?
Chaos testing involves deliberately crashing the system by simulating failures in a controlled environment. This helps devops engineers identify single points of failure and other weak spots in the system and resolve them proactively often making use of automated recovery mechanism. The simulation is typically done using chaos monkeys.
What are chaos monkeys?
Chaos monkeys are tools which disrupts the system in multiple ways. By unleashing them in the middle of a business day, they simulate different incidents in the system such as network failure, node failure etc. All this would be done in a controlled environment so that the perimeters are well defined to minimize customer impact.
How are chaos monkeys classified?
They chaos monkeys are classified based on their capabilities as:
- Latency monkey: Latency monkeys simulate delays in node-server communication, causing the network to slow down and actuate response measures. With larger delays, they simulate server down like conditions without bringing the server down.
- Conformity monkey: Conformity monkeys shuts down evert program that does not stick to best practices. This allows the devops teams to identify the program and restructure it better.
- Janitor monkeys: Janitor monkeys optimize storage by clearing unused information and thus freeing space.
- Security monkey: Security monkeys work analogous to conformity monkeys by terminating all incidents that cause violations and vulnerabilities. They also make sure that valid and renewed.
- 10-18 monkeys: They simulate run-time issues in systems which serve in multiple geographies.
- Chaos Gorilla: Chaos gorillas are similar to chaos monkey but the scale of the outage would be exceptionally large in this case.
How are chaos tests conducted?
To conduct a chaos test,
- Several tools are used to monitor the system in resting state and the outputs are used to define this state.
- Make a hypothesis that the system will hold.
- Fix the perimeter in such a way that the customer impact is minimum.
- Initiate the disruptions through chaos monkeys. The monkeys would simulate real life incidents such as server and network outage, hardware failure etc.
- Analyse and resolve issues and record the process for future reference.
- Monitor the system diligently before during and after the outage and conduct the test regularly to detect weak spots and proactively resolve incidents.
It is best to conduct these tests in a production environment to eliminate customer impact.
- Devops team can quickly identify and resolve issues which slipped through testing.
- Unanticipated outages are minimized.
- Enhances resilience.
- Ideal for large and complex systems.
However, chaos testing is ineffective in the case of smaller systems.
Use AlertOps to never miss and alert and monitor your systems 24/7 on and off testing!!!