Chaos Engineering: Benefits, Best Practices, and Challenges

Enterprise software systems have become more sophisticated, relying heavily on distributed components like cloud services and microservices. These systems are susceptible to disruptive events at any time, leading to system outages and unsatisfied customers.
Chaos engineering plays a vital role today in creating resilient systems.
This article walks you through the concept of chaos engineering, its importance, and its core principles. Additionally, we’ll explain the tools widely used today for chaos engineering and delve into the benefits and challenges associated with chaos engineering practices.
Splunk ITSI is an Industry Leader in AIOps
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.

Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
What is chaos engineering, and why is it important?
Chaos engineering assesses the resilience of production systems by testing their ability to withstand chaotic conditions or unpredictable and random behavior. This is accomplished by:
- Running a series of experiments against systems.
- Intentionally introducing failures.
- Observing their behavior.
Chaos engineering originates from the popular concept called “chaos theory,” which focuses on the impact of unpredictable and random behavior in systems.
Chaos theory aims to discover potential failure points and vulnerabilities in underlying systems. Then, those issues can be corrected before being shifted to production environments. This will prevent potential system outages that could impact the availability of end users. Chaos engineering can significantly increase confidence in the resilience of production systems, particularly in unforeseen conditions.
Principles of chaos engineering
Unlike other types of testing that rely on prior knowledge to assume the system's behavior, chaos testing makes assumptions about a system and creates new insights. Chaos testing first hypothesizes how a system should behave when a particular failure scenario occurs, then experiments are designed and run to check the behavior of the system, offering key response insights.
Chaos engineering defines general principles to follow when designing and conducting experiments.
Define the system’s steady-state
How does the system behave when it is steady? These definitions set the baseline for the experiments. The definition of steady state includes measurable outcomes defined using key performance indicators (KPIs). Some examples of such KPIs are:
- “The system latency is below 300ms.”
- “The error rate is below 3%.”
Create the hypothesis
A chaos experiment needs a hypothesis on how the system will behave if a chaotic situation arises in a production environment. It should be based on the established baselines and knowledge of the behavior and weaknesses of the system. When creating a hypothesis, ask ‘what if’ questions or create statements on how the system should behave.
Examples include:
- “If we increase the load by 1x, the system can handle it without issue.”
- “An increase in request latency will not impact the user experience.”
- “If the primary database is down, the system will automatically failover to the secondary database with minimum downtime.”
Experiment by changing real-world conditions
Consider real-world scenarios or events that can deviate from the steady state. For example:
- Events resulting in hardware and software failures.
- High network latency and error rates.
- Network traffic spikes.
It helps identify vulnerabilities and ensures that the system can handle different scenarios.
Run automated experiments in production environments
Prior systems like development, staging, and pre-production do not simulate the actual production systems. That’s why chaos engineering experiments run in actual production systems under controlled conditions.
Minimizing the blast radius
Since chaos engineering experiments are conducted within real production environments, it is crucial to minimize any potential performance degradation or disruptions that customers may experience during their execution. The blast radius should be determined using metrics such as:
- The number of affected users
- Impacted locations
- Workload quantities
Therefore, it is advisable to schedule these experiments during non-peak times and ensure the availability of backup systems for restorations.
Best practices for chaos engineering
Chaos engineering requires careful integration of some best practices to ensure the seamless execution of experiments and gain insights into system behavior under chaotic conditions.
Gradually scale up your experiments
First, start with a smaller component of your system and introduce a minor disruption with limited impact. As you gain confidence, gradually scale up the experiments, increasing the complexity and intensity of the disruptions.
Focus on critical parts
During the hypothesis creation phase, it is crucial to prioritize critical components of the system and create specific, realistic hypotheses.
Accept failures
In the event an experiment fails, it is important to avoid discouragement and instead consider it part of the experiment where you learned something. So, be open to failure and improve next time.
Measure and monitor everything
Chaos experiments should result in metrics that provide insights into the impact of those experiments. Those measurements help you discover how systems behave under abnormal conditions and provide valuable insights into areas that need improvement.
Automate the experiments
Chaos experiments should be automated as much as possible, enabling rapid and continuous execution of repeated experiments while minimizing the need for manual, labor-intensive processes.
Incorporate what you have learned
Chaos engineering experiments result in important discoveries regarding system behaviors that have not been identified before. For instance, these experiments can:
- Demonstrate the necessary changes in system architecture.
- Provide insights into the resilient strategies that should be integrated into the system.
Incorporating this valuable knowledge into decision-making processes can contribute to the development of more resilient systems.
Involve all parties concerned
Chaos engineering is a collaborative effort — it is essential to involve all concerned parties, including product managers, developers, and operations engineers, throughout the process. It offers everyone mutual understanding and helps meet their expectations.
Benefits of chaos engineering
Companies that leverage chaos engineering practices reap numerous benefits in many ways.
- Improved system resilience and availability - Chaos engineering aims to discover potential issues when the system faces unexpected circumstances. This proactive approach to identifying issues helps improve the existing resiliency strategies of the systems and gain confidence in the reliability of their systems.
- Prevent revenue losses - Depending on the criticality and system usage, an unexpected system outage can potentially result in the loss of billions of dollars in revenue. Chaos tests help prevent such revenue loss and reduce maintenance costs.
- Develop an in-depth understanding of the system - Chaos engineering generates new knowledge about systems. It helps organizations better understand system behaviors, dependencies, and other interactions with different components. This in-depth understanding helps create better architectures in the future.
- Improve failure recovery - Since chaos tests provide a good understanding of system behaviors in different outage conditions, organizations can speed up recovery in the event of similar outages.
- Increase customer satisfaction - Chaos engineering enhances failure recovery and reduces downtime, bolstering your reputation as a dependable system and fostering customer satisfaction.
Challenges of chaos engineering
- Risk of outages. Since chaos engineering tests run on production systems, there is a risk of data loss or service outages, so it's critical to carry out careful test planning and execution.
- Resources limitations. Chaos engineering requires tools and human resources to plan and execute tests, which can be a limiting factor for some organizations.
- Requirement of robust monitoring systems. Chaos tests require robust monitoring systems for monitoring system health and other metrics, making it crucial to make prior investments and carefully select a reliable monitoring tool to enhance the effectiveness of chaos engineering.
Tools used in chaos engineering
As the significance of chaos engineering continues to grow, numerous software tools have emerged to streamline and facilitate the process. Following are some of the well-known and widely-used tools.
Gremlin
Helps perform chaos engineering experiments in all public cloud environments, such as AWS, Azure, and GCP. It provides pre-built reliability tests to get started and identify issues faster. This tool can simulate various types of attacks and failure scenarios. Currently, Gremlin supports Linux, Windows, and containerized environments like Kubernetes and bare metal.
Chaos Monkey
One of the pioneering chaos engineering tools introduced by Netflix, from which they built a complete failure injection tool called “Simian Army”. It simulates only one failure type, randomly terminating instances during a specific time frame. Importantly, this tool is designed to avoid any impacts on customers in production systems.
(Related reading: intro to chaos monkey.)
LitmusChaos
An open-source chaos engineering platform that leverages a cloud-native strategy for controlling and managing chaos practices, this user-friendly tool enables the proactive creation of chaos experiments, issue discovery, and efficient remediation processes. It can also be used to create and analyze chaos within Kubernetes environments.
Chaos Mesh
Another open-source and cloud-native tool that can simulate failures like network latency, and resource utilization issues, this tool leverages Kubernetes environments to conduct chaos experiments. Additionally, Chaos Mesh can be integrated into DevOps workflows to discover abnormal behaviors during various stages of the product development life cycle.
(Learn how DevOps automation can improve your security testing and monitoring.)
AWS Fault Injection Simulator (FIS) and AWS Resilience Hub
AWS-managed services where you can perform chaos testing on AWS services. This tool requires users to create an experiment template defining the actions, targets, and stop conditions of the experiments. The AWS Resilience Hub enables centralized management of resilience tests within the AWS environment.
Steadybit
A tool that integrates resilience tests into continuous integration and deployment workflows. You can add open-source extension kits or create your own for flexible resilience test creation and execution. Extensions provided by the tool support a wide range of programming languages, allowing you to work with your preferred language.
Summing up the chaos
Chaos Engineering is a must-have practice for modern enterprise software systems as they depend on distributed components. This approach deliberately introduces failures according to chaos engineering principles and observes the system's behavior.
Some chaos engineering principles include:
- Creating your hypotheses.
- Experimenting with real-world conditions in production systems.
- Minimizing the blast radius.
Currently, there are several software tools for chaos testing. Companies can gain many benefits from chaos engineering, such as enhanced system resilience and availability, improved customer satisfaction, and increased revenue. Nonetheless, there are also some challenges, such as the risk of outages, resource limitations, and the need for robust monitoring systems.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
Related Articles
About Splunk
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.