Downtime Demystified: A Deep Dive into Common Causes and Fixes

How should your organization prepare for downtime’s most common causes, from human error to hardware failure?

By Mala Pillutla, GVP of Observability, Splunk & David Dalling, GVP, Global Security Strategists
SEPTEMBER 12, 2024 • 4 minute read

When it comes to downtime, the focus has mainly been on incidents caused by traditional IT issues, overlooking ones brought on by cybersecurity failures. However, downtime can come from anywhere: According to Splunk’s The Hidden Costs of Downtime report, 56% of incidents are cybersecurity-related, while 44% stem from application or infrastructure issues.

Consequently, the most resilient companies employ successful mitigation strategies that account for application or infrastructure issues and cybersecurity failures. Here, we will explore downtime’s most common culprits and best practices your organization can adopt to mitigate downtime, regardless of origin.

Humans are (mostly) to blame

Despite an abundance of downtime causes, human error is number one. This is true across Security, ITOps, and Engineering. Half of all technology executives surveyed admit that human error — such as misconfiguring software or infrastructure — is "often" or "very often" to blame. And it’s clear why: making even simple mistakes can lead to performance errors that drag systems down or put a company's security at risk.

Not only is human error the most common cause of downtime, but it's also the hardest to detect and fix. Its MTTD is 17-18 hours, and its MTTR is 67-76 hours. That’s 2-3 days of panic and finger-pointing.

On the security side, respondents say malware and phishing attacks are also frequent causes, while many say some of the rarest incidents they encounter take longer to find and fix. For example, the detection and recovery times for “zero-day” exploits are likely high because it’s difficult to identify a root cause, and organizations often lack the processes to address it.

Software failures create considerable downtime for ITOps and Engineering teams as organizations adopt modern application development and deployment practices that are more complex and have added increasing points of failure. 34% of ITOps and Engineering professionals also blame hardware failure.

What other factors could cause downtime? 43% of tech respondents admit their dev teams often go outside the approved tech stack to deploy new technologies, which could contribute to more downtime and serious security incidents. Meanwhile, 78% say their organization is willing to accept downtime risk to adopt new technologies. Complexity in the application's infrastructure and architecture, along with heavy demand for pushing innovation out quickly, leads to even more instances of human error.

Cutting down on downtime

Whether an incident stems from a security breach, network outage, or a software/hardware failure, below, we’ve outlined some best practices your organization can adopt to mitigate downtime.

Always root out the root cause
54% of technology executives surveyed admit they sometimes intentionally do not fix the root cause of a downtown incident. This could be for several reasons. For example, they may already plan to decommission an older application responsible for the outage, as it could have larger impacts or create outages in other areas of the business as well. Splunk recommends finding and fixing an incident’s root cause to be a best practice because it can stop repeat issues by singling out the underlying problem and pointing to a fix.

Pro tip: Investing in Observability solutions and integrating and instrumenting your data across your environment (including security/data teams) will make finding and fixing root causes much easier. Getting rid of data silos creates thorough postmortems that will prevent repeat issues.
Connect your teams and tools
Since downtime can come from almost anywhere, complete visibility across SecOps, ITOps, and Engineering teams is essential. Sharing tools, data, and context will enable easier collaboration and problem-solving across teams. This will help your organization identify and fix the root cause faster so you can get back up and running quicker.
Be proactive
Resilient organizations take the lead in preventing issues. By investing in AI- and ML-driven solutions for pattern recognition, you’re equipping your SecOps, ITOps, and Engineering teams with a proactive and collaborative downtime prevention program. Predictive analytics powered by AI act as a force multiplier, helping to avert issues before they occur. Over half of technology executives surveyed report using generative AI features embedded into existing solutions to address downtime, with 64% claiming significant benefits. The most resilient organizations are more mature in their adoption of generative AI, expanding their use of these features at 4x the rate of the majority of respondents.
Adopt a no-tolerance approach to downtime
Our research underscores that the most resilient organizations experience downtime less frequently, recover faster, and incur fewer overall costs. Why? They grasp the financial impact of downtime more keenly than others. They see the substantial costs and view downtime as unacceptable, investing deliberately in practices and solutions to prevent it.

Resilience restores balance

If there’s one lesson to take away from The Hidden Costs of Downtime, it’s that digital resilience is a business imperative. The majority of technology executives surveyed admit that the negative impacts they experience from downtime are unacceptable. There’s just too much at stake, both for companies and customers. By understanding that downtime can come from application, infrastructure, and security issues, putting plans in place that address downtime’s diverse causes will help you champion a more resilient business.

Read The Hidden Costs of Downtime report for more on how the most resilient organizations set themselves apart from the rest and Splunk’s recommendations for deterring downtime.

Downtime Demystified: A Deep Dive into Common Causes and Fixes

Humans are (mostly) to blame

Cutting down on downtime

Get more perspectives from security, IT and engineering leaders delivered straight to your inbox.