Skip to main content
false

Perspectives Home / CISO CIRCLE

Embracing the Chaos: How Real Downtime Drills Build Resilient Organizations

Downtime happens, so it's best to have the proper procedures and tools in place. To strengthen your plan, consider adopting “chaos engineering,” which involves real drills that build more resilient systems. 

two individuals talking in adjacent chairs

In an ideal world, hardware always works, software stacks are bug-free, networks are reliable, and users never submit invalid data. But in the real world, none of that is true. Something will always break or behave differently than we expect. There are bugs, misconfigurations, and outright attacks. A surprise awaits at every turn. 


An organization’s long-term success is defined by how it prepares for and navigates unexpected scenarios. Although there’s no one-size-fits-all approach, there are certain practices, such as instrumenting every application, following a strict runbook for outages, and identifying owning engineers that are essential to maintaining good corporate hygiene. However, to truly strengthen an organization’s overall resilience, I recommend adopting the art of “chaos engineering.”


Chaos engineering is a way to break your systems in a controlled manner to replicate real situations in which an application will operate as closely as possible. Chaos engineering embraces the attitude that systems will not always work correctly. It lets you break them in a controlled environment so that when they fail in the real world, you can manage the situation effectively. Simply put, it’s a way to make your architecture more resilient by purposefully introducing unpredictability into your environment. 


Here, I’ll explore some principles and theories behind chaos engineering, the benefits of exposing SecOps, ITOps, and engineering teams to controlled chaos, and steps your organization can take to become more robust and resilient. 

 

Controlling chaos

Although the term chaos engineering implies that it's not intentional or thought out, that’s the wrong way to approach it. You need to have a plan. Think of it as an experiment and create a hypothesis. Ask yourself: 


What are you trying to test?

What's the blast radius going to be? 

What will you do if your test's consequences don’t go as expected? 

Do you have a plan? 


A chaos engineering practice in ITOps and engineering environments would involve inducing failures in a way you can control. For example, you could randomly kill running Kubernetes pods, disable network connectivity, or induce latency. You could even drop every third packet into a network interface or set your load balancer to send all the traffic to one instance. The possibilities are endless when emulating potential failures in your environment. By doing so, you are training your team to respond to them. Building alerts and automation around them creates more resilient applications.


A mature chaos engineering practice would come unpredictably and seemingly at the worst possible times, such as during an existing (non-business critical) outage or a release cycle. When your systems are at their weakest points, chaos engineering is exactly what you should do. Probing for the most dangerous or problematic times for your business makes your organization and architecture more resilient. However, if you are experiencing a customer-affecting outage, hold off on the chaos. Make sure you build an “off switch” into your chaos infrastructure so you can break glass in case of emergency. 

 

The benefits of chaos 

When it comes to ITOps and engineering, chaos engineering creates more robust and reliable applications. Instead of tossing and turning at night wondering about the next fire drill, teams can sleep knowing they will spend their valuable time innovating and working on new application updates. After all, they’ve already experienced (and mitigated) tons of failure scenarios.


SecOps, on the other hand, can be resistant to chaos engineering, and with good reason: Since you’re deliberately introducing problems, you may also intentionally exercise code paths that haven't been thoroughly tested or may create security problems down the road. 


However, security teams already employ a version of chaos engineering called “fuzzing.” Although cute-sounding, fuzzing uses random or invalid data, potentially sent to unexpected endpoints, as inputs to find vulnerabilities. Unlike chaos engineering, the intent is not to break the software or cause a problem. Instead, it’s to ensure that applications are secure against any possible input and that things like buffer overflows are mitigated against by their design.


But when you perform chaos engineering in a security context, you take it one step further, thinking like an attacker and doing things only an attacker would do. The difference is that you have control over what you inject into the SQL and the attacks used on your exposed infrastructure. Performing chaos engineering ultimately makes applications more secure, resilient, and robust. 


Failure makes more reliable systems

Downtime is inevitable. However, if you implement chaos engineering correctly, you’ll take steps to control it and its effects on customers, absorbing much of that business risk. When an actual downtime incident occurs, with real associated costs, you’ve already thought about, planned, and executed an experiment to figure out how your infrastructure handles it. By taking ownership of downtime, you become more resilient as an organization. In summary, chaos engineering:

  1. Bullet-proofs your infrastructure with plans
  2. Makes addressing problems head-on a part of your company culture
  3. Gets you out of the mindset that your application will be up 100% of the time (every 9 has a price, after all – the more reliable the system, the faster the rate of cost increases)


Even the most reliable services experience downtime, so it’s best to be prepared. Chaos engineering is a preemptive practice that front-loads planning, problem-solving, and troubleshooting before it affects your customers. And that’s worth wreaking havoc over.  

 

Read Splunk’s The Hidden Costs of Downtime report for more recommendations on championing a resilient business.

Read more Perspectives by Splunk

JUNE 12, 2024 • 3 minute read

Uncovering Downtime’s $400B Impact

 

 

Nothing is certain in life except death, taxes, and downtime.

MARCH 25, 2024 • 2 minute read

What Science Fiction Can Teach Us About Cybersecurity Realities


With artificial intelligence being the topic du jour, AI can be the trigger to accelerate automated information sharing.

JANUARY 5, 2023 • 2 minute watch

Data Privacy in the Era of AI

 

What impacts will new generative AI advancements have on data privacy regulation in 2024? And how should companies prepare?

Get more perspectives from security, IT and engineering leaders delivered straight to your inbox.