Why is chaos engineering important?

Chaos engineering helps organizations identify weaknesses in their systems before they cause outages or customer impact, enabling teams to build more resilient systems.

How does chaos engineering work?

Chaos engineering works by intentionally introducing failures or disruptions into a system to observe how it responds, with the goal of uncovering vulnerabilities and improving system reliability.

What are some common chaos engineering experiments?

Common chaos engineering experiments include shutting down servers, introducing network latency, simulating service outages, and testing failover mechanisms.

Who uses chaos engineering?

Chaos engineering is used by organizations that rely on complex distributed systems, such as technology companies, financial institutions, and e-commerce platforms.

Learn

July 17, 2023

7 Minute Read

Chaos Engineering: Benefits, Best Practices, and Challenges

Q: What is chaos engineering?

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system‚Äôs capability to withstand turbulent conditions in production.

By Shanika Wickramasinghe

Enterprise software systems have become more sophisticated, relying heavily on distributed components like cloud services and microservices. These systems are susceptible to disruptive events at any time, leading to system outages and unsatisfied customers.

Chaos engineering plays a vital role today in creating resilient systems.

This article walks you through the concept of chaos engineering, its importance, and its core principles. Additionally, we’ll explain the tools widely used today for chaos engineering and delve into the benefits and challenges associated with chaos engineering practices.

What is chaos engineering, and why is it important?

Chaos engineering assesses the resilience of production systems by testing their ability to withstand chaotic conditions or unpredictable and random behavior. This is accomplished by:

Running a series of experiments against systems.
Intentionally introducing failures.
Observing their behavior.

Chaos engineering originates from the popular concept called “chaos theory,” which focuses on the impact of unpredictable and random behavior in systems.

Chaos theory aims to discover potential failure points and vulnerabilities in underlying systems. Then, those issues can be corrected before being shifted to production environments. This will prevent potential system outages that could impact the availability of end users. Chaos engineering can significantly increase confidence in the resilience of production systems, particularly in unforeseen conditions.

Principles of chaos engineering

Unlike other types of testing that rely on prior knowledge to assume the system's behavior, chaos testing makes assumptions about a system and creates new insights. Chaos testing first hypothesizes how a system should behave when a particular failure scenario occurs, then experiments are designed and run to check the behavior of the system, offering key response insights.

Chaos engineering defines general principles to follow when designing and conducting experiments.

Define the system’s steady-state

How does the system behave when it is steady? These definitions set the baseline for the experiments. The definition of steady state includes measurable outcomes defined using key performance indicators (KPIs). Some examples of such KPIs are:

“The system latency is below 300ms.”
“The error rate is below 3%.”

Create the hypothesis

A chaos experiment needs a hypothesis on how the system will behave if a chaotic situation arises in a production environment. It should be based on the established baselines and knowledge of the behavior and weaknesses of the system. When creating a hypothesis, ask ‘what if’ questions or create statements on how the system should behave.

Examples include:

“If we increase the load by 1x, the system can handle it without issue.”
“An increase in request latency will not impact the user experience.”
“If the primary database is down, the system will automatically failover to the secondary database with minimum downtime.”

Experiment by changing real-world conditions

Consider real-world scenarios or events that can deviate from the steady state. For example:

Events resulting in hardware and software failures.
High network latency and error rates.
Network traffic spikes.

It helps identify vulnerabilities and ensures that the system can handle different scenarios.

Run automated experiments in production environments

Prior systems like development, staging, and pre-production do not simulate the actual production systems. That’s why chaos engineering experiments run in actual production systems under controlled conditions.

Minimizing the blast radius

Since chaos engineering experiments are conducted within real production environments, it is crucial to minimize any potential performance degradation or disruptions that customers may experience during their execution. The blast radius should be determined using metrics such as:

The number of affected users
Impacted locations
Workload quantities

Therefore, it is advisable to schedule these experiments during non-peak times and ensure the availability of backup systems for restorations.

Best practices for chaos engineering

Chaos engineering requires careful integration of some best practices to ensure the seamless execution of experiments and gain insights into system behavior under chaotic conditions.

Gradually scale up your experiments

First, start with a smaller component of your system and introduce a minor disruption with limited impact. As you gain confidence, gradually scale up the experiments, increasing the complexity and intensity of the disruptions.

Focus on critical parts

During the hypothesis creation phase, it is crucial to prioritize critical components of the system and create specific, realistic hypotheses.

Accept failures

In the event an experiment fails, it is important to avoid discouragement and instead consider it part of the experiment where you learned something. So, be open to failure and improve next time.

Measure and monitor everything

Chaos experiments should result in metrics that provide insights into the impact of those experiments. Those measurements help you discover how systems behave under abnormal conditions and provide valuable insights into areas that need improvement.

Automate the experiments

Chaos experiments should be automated as much as possible, enabling rapid and continuous execution of repeated experiments while minimizing the need for manual, labor-intensive processes.

Incorporate what you have learned

Chaos engineering experiments result in important discoveries regarding system behaviors that have not been identified before. For instance, these experiments can:

Demonstrate the necessary changes in system architecture.
Provide insights into the resilient strategies that should be integrated into the system.

Incorporating this valuable knowledge into decision-making processes can contribute to the development of more resilient systems.

Involve all parties concerned

Chaos engineering is a collaborative effort — it is essential to involve all concerned parties, including product managers, developers, and operations engineers, throughout the process. It offers everyone mutual understanding and helps meet their expectations.

Benefits of chaos engineering

Companies that leverage chaos engineering practices reap numerous benefits in many ways.

Improved system resilience and availability - Chaos engineering aims to discover potential issues when the system faces unexpected circumstances. This proactive approach to identifying issues helps improve the existing resiliency strategies of the systems and gain confidence in the reliability of their systems.
Prevent revenue losses - Depending on the criticality and system usage, an unexpected system outage can potentially result in the loss of billions of dollars in revenue. Chaos tests help prevent such revenue loss and reduce maintenance costs.
Develop an in-depth understanding of the system - Chaos engineering generates new knowledge about systems. It helps organizations better understand system behaviors, dependencies, and other interactions with different components. This in-depth understanding helps create better architectures in the future.
Improve failure recovery - Since chaos tests provide a good understanding of system behaviors in different outage conditions, organizations can speed up recovery in the event of similar outages.
Increase customer satisfaction - Chaos engineering enhances failure recovery and reduces downtime, bolstering your reputation as a dependable system and fostering customer satisfaction.

Challenges of chaos engineering

Risk of outages. Since chaos engineering tests run on production systems, there is a risk of data loss or service outages, so it's critical to carry out careful test planning and execution.
Resources limitations. Chaos engineering requires tools and human resources to plan and execute tests, which can be a limiting factor for some organizations.
Requirement of robust monitoring systems. Chaos tests require robust monitoring systems for monitoring system health and other metrics, making it crucial to make prior investments and carefully select a reliable monitoring tool to enhance the effectiveness of chaos engineering.

Tools used in chaos engineering

As the significance of chaos engineering continues to grow, numerous software tools have emerged to streamline and facilitate the process. Following are some of the well-known and widely-used tools.

Gremlin

Helps perform chaos engineering experiments in all public cloud environments, such as AWS, Azure, and GCP. It provides pre-built reliability tests to get started and identify issues faster. This tool can simulate various types of attacks and failure scenarios. Currently, Gremlin supports Linux, Windows, and containerized environments like Kubernetes and bare metal.

Chaos Monkey

One of the pioneering chaos engineering tools introduced by Netflix, from which they built a complete failure injection tool called “Simian Army”. It simulates only one failure type, randomly terminating instances during a specific time frame. Importantly, this tool is designed to avoid any impacts on customers in production systems.

(Related reading: intro to chaos monkey.)

LitmusChaos

An open-source chaos engineering platform that leverages a cloud-native strategy for controlling and managing chaos practices, this user-friendly tool enables the proactive creation of chaos experiments, issue discovery, and efficient remediation processes. It can also be used to create and analyze chaos within Kubernetes environments.

Chaos Mesh

Another open-source and cloud-native tool that can simulate failures like network latency, and resource utilization issues, this tool leverages Kubernetes environments to conduct chaos experiments. Additionally, Chaos Mesh can be integrated into DevOps workflows to discover abnormal behaviors during various stages of the product development life cycle.

(Learn how DevOps automation can improve your security testing and monitoring.)

AWS Fault Injection Simulator (FIS) and AWS Resilience Hub

AWS-managed services where you can perform chaos testing on AWS services. This tool requires users to create an experiment template defining the actions, targets, and stop conditions of the experiments. The AWS Resilience Hub enables centralized management of resilience tests within the AWS environment.

Steadybit

A tool that integrates resilience tests into continuous integration and deployment workflows. You can add open-source extension kits or create your own for flexible resilience test creation and execution. Extensions provided by the tool support a wide range of programming languages, allowing you to work with your preferred language.

Summing up the chaos

Chaos Engineering is a must-have practice for modern enterprise software systems as they depend on distributed components. This approach deliberately introduces failures according to chaos engineering principles and observes the system's behavior.

Some chaos engineering principles include:

Creating your hypotheses.
Experimenting with real-world conditions in production systems.
Minimizing the blast radius.

Currently, there are several software tools for chaos testing. Companies can gain many benefits from chaos engineering, such as enhanced system resilience and availability, improved customer satisfaction, and increased revenue. Nonetheless, there are also some challenges, such as the risk of outages, resource limitations, and the need for robust monitoring systems.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Shanika Wickramasinghe

Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate about everything she does, loves to travel and enjoys nature whenever she takes a break from her busy work schedule. She also writes for her Medium blog sometimes. You can connect with her on LinkedIn.

Learn 6 Min Read

What is Real User Monitoring?

Real User Monitoring (RUM) helps you monitor visitors' activities, revealing critical insights into the user experience. Get the expert story here.

Learn 6 Min Read

The Purple Team: Combining Red & Blue Teaming for Cybersecurity

Learn how purple teams bridge the gap between offensive and defensive security strategies, and are helping organizations identify and mitigate risks effectively

Learn 4 Min Read

Data Aggregation: How It Works

Without aggregation, insights from a given data set will lack value and accuracy. Aggregation is the first step toward unlocking the story behind the numbers.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram

Follow @Splunk

See Splunk Perspectives blog for execs

Get Perspectives

Chaos Engineering: Benefits, Best Practices, and Challenges

What is chaos engineering, and why is it important?