Quality
8 min
Chaos Engineering is a discipline that seeks to build resilient systems that can withstand the chaotic conditions in which they operate. As large-scale, distributed software continues to grow more complex, software engineering has adapted by increasing development speed and flexibility.
Chaos Engineering helps developers build confidence in their complex systems by testing them under a variety of turbulent conditions.
Like any discipline, failing to follow best practices will hinder efficacy and waste resources, so let's review the standards and ideals of Chaos Engineering at its best.
The best practices of chaos engineering have truly been shaped by the large companies that spearheaded its conception, mostly out of necessity.
Take Netflix, for example, which evolved internal chaos testing over time to help them manage latency, conformity, and monitor a number of metrics across their incredibly complex software.
Other organizations have done the same, and so we've compiled and refined these principles based on these large-scale, highly effective applications of chaos engineering.
In order for chaos engineering principles to work, you have to define your control. In other words, you have to determine what the "normal" state is for whatever service or application you're working with.
This first step involves integrating the appropriate monitoring services so that you can determine what is normal behavior.
From there, you must define the thresholds so that you know when the system crosses outside of the norm.
Any professional involved in chaos engineering should be able to clearly define when a system is behaving normally and immediately recognize when it is behaving abnormally.
You can accomplish that by carefully choosing metrics and monitoring them closely.
Once you know what normal behavior is for your system, you can begin to put chaos engineering to work by disrupting that state of normal in a controlled manner.
What's essential here is that you create realistic disruption by thinking about the most likely scenarios that could cause your system to fail.
Think systematic traffic spikes, killing servers, or other chaos that's likely to occur over time.
The chaos tests that you execute should reflect real-world scenarios.
It should be more a matter of "when" than "if," otherwise you aren't going to get a whole lot of value out of chaos engineering.
The goal with testing is merely to understand how certain failures impact your system so that you can then change your system in a way that makes it more resilient.
By nature, chaos engineering requires you to test scenarios that will produce unknown results.
This means there's always a chance of system downtime or another negative impact on your users.
Responsible chaos engineering requires your team to prepare ahead for any foreseeable incidents so that you can monitor for customer problems and quickly get things up and running again should any collateral damage occur.
While you should try to minimize any outages that result from chaos testing, remember that every failure is a learning experience.
If something you do causes an outage, it will produce plenty of helpful insights to help you build more robust solutions that won't fail under those conditions again.
Practice makes perfect, so your first few chaos tests should always be held in a staging environment.
However, you ultimately want to reach a point where you're running every chaos test in your actual live production environment.
It might sound crazy, but that's the only way to see the real impact of failures and outages.
If you're worried that your chaos testing will negatively impact users, go back to the previous principle.
Carefully planning your test and preparing in advance for foreseeable outages will help your team confidently proceed with rigorous chaos tests and collect meaningful data along the way.
Going back to how Netflix uses chaos engineering to achieve continuous delivery, they've taken an approach of "continuous chaos" where their chaos testing tools are constantly running around trying to break things.
With this approach, the Netflix team is able to quickly identify issues and fix them, possibly before they ever impact a real-world customer.
Continuous chaos will help your team of developers deeper knowledge of your systems and get on the path to continuous improvement, both for the systems you have now and for any systems you create in the future.
It can be very scary to release your first chaos test in your live production environment, and even scarier when you realize that a chaos test has impacted real-world users.
By nature, chaos testing is filled with unknowns, which is why it's so important to build confidence in your testing methods.
Planning in advance, keeping the right people around in case of incidents, and simply committing to each test will help you become more effective.
Make sure you take detailed notes and carefully track the right metrics so that you can characterize outcomes and learn from incidents.
At Adservio, our mission is to build more resilient digital experiences, and chaos testing is just one of many methods we apply to help our clients protect their applications and systems from outages.
If you need help implementing chaos testing, or if you're interested in learning more about our approach to high availability at scale, contact our team.