Chaos Monkey is an open-source tool. Its primary use is to check system reliability against random instance failures.
Chaos Monkey follows the testing concept of chaos engineering, which prepares networked systems for resilience against random and unpredictable chaotic conditions.
Let’s take a deeper look.
Developed and released by Netflix, the GitHub for Chaos Monkey describes this open-source tool:
Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures.
The tool is based on the concepts of chaos engineering, which encourages experimentation and causing intentional incidents in order to test and ensure system reliability.
As such, it’s often part of software testing and the quality assurance (QA) part of a software development pipeline or practice.
Other dev-related practices that touch on chaos engineering include site reliability engineering (SRE), performance engineering, and even platform engineering.
In the traditional software engineering and quality assurance (QA) approach, the functional specifications of the software design also define its behavioral attributes.
In order to evaluate the behavior of an isolated software system, we can evaluate the output of all input conditions and functional parameters against a reference measurement. Various testing configurations and types can collectively — in theory — guarantee full test coverage.
But what happens in a large-scale, complex and distributed network environment?
In complex distributed systems, as most organizations are, the functional specifications are not exhaustive: creating a specification that accurately outlines the mapping between an input and output combinations for every system component, node, and server is virtually impossible.
This means that the behavior of a system component is not fully known. That’s due to two primary factors:
So how do you characterize the behavior of these systems in an environment where IT incidents can occur randomly and unpredictably?
Netflix famously pioneered the discipline of Chaos Engineering with the following principles:
Identify a reference state that characterizes optimal working behavior of all system components. This definition can be vague: how do you describe a system behavior as optimal?
Availability metrics and dependability metrics are commonly chosen in the context of reliability engineering.
(Related reading: IT failure metrics.)
A series of computing operations can lead known inputs to known outputs. This refers to the execution path of a software operation. The traditional approach to software QA evaluates all a variety of execution paths as part of a full test coverage strategy.
Chaos engineering employs a different approach. It injects randomness into the variations within the execution path of a software system.
How does it achieve this? The Chaos Monkey tooling injects random disruptions by terminating virtual machines (VMs) and server instances in microservices-based cloud environments.
Testing in the real-world means replicating the production environment. The only challenge here is that an internet-scale production environment cannot be replicated on a small set of testing servers.
Even if a testing environment exists that can fully reproduce the real-world production environment, the core concept of chaos engineering is to evaluate system resilience against real-world and unpredictable scenarios.
That’s why this principle exists: so that, no matter how closely your test environment is like your prod environment, Chaos engineering still wants you to perform experiments on prod.
Automate experiments that are run against both control groups and experimental groups. The differences between the hypothesized steady state are measured.
This is a continuous process and automated using tools such as Chaos Monkey, which injects system failure but ensures that the overall system operations are feasible.
(Related reading: chaos testing & autonomous testing.)
The goal for Chaos Monkey: Intentional failures in production environments
The idea of introducing failures in a production environment is daunting for DevOps and QA teams — after all, they’re striving to maintain maximum availability and mitigate the risk of downtime.
Chaos Monkey is in fact designed to limit the risks associated with testing in the production environment as part of its design philosophy and principles:
And what does it mean in practice for the users of Chaos Monkey?
When defining the failure scenarios, as part of developing a failure model, it is important to bridge the gap between the generated and real-world model distribution for failure incidents.
The tool itself is simple — it does not employ complex probabilistic models to mimic real-world incident trends and data distribution. You can easily simulate:
These test scenarios should be based on known performance on your dependability metrics. This means that the discussion around effective use of tools such as Chaos Monkey, and reliability engineering in general, is incomplete without a discussion around monitoring and observability.
In the context of failure injection, you should continuously monitor the internal and external states of your network. In essence, you should:
Finally, the use of tools such as Chaos Monkey can also prepare your organization for a cultural change: a culture that accepts mistakes by modeling test scenarios of random and unpredictable IT incidents.
(Related reading: IT change management & organizational change management.)
Splunk is proud to be recognized as a Leader in Observability and Application Performance Monitoring by Gartner®. View the Gartner® Magic Quadrant™ to find out why. Get the report →
Learn more about Splunk's Observability products & solutions:
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.