Rethinking Availability: Why You Need Collaborative Incident Response

By Splunk

In order to stay competitive in today’s market, businesses are expected to innovate—quickly. Many engineering teams feel pressure to build, deploy and operate services with increasing speed. High-performing teams innovate faster and maintain their sanity because they’re able to quickly recover from incidents.

As engineering teams collectively move from agile development to rapid deployment and a culture of DevOps, teams need to think beyond a reactive operations center. In order to do so, progressive teams are turning to improved, collaborative incident management workflows and tooling.

Managing incident response is more than just a “nice to have” and it certainly requires more than a ticketing or alerting system. Collaborative incident response is a DevOps essential, or perhaps more importantly, it’s a cornerstone to engaging high-performing engineering teams who champion uptime and own on-call—instead of fear it.

High availability is essential to business success, an issue complicated by the increasing deployment demands of a highly competitive market. Accordingly, investing in processes to ensure near-zero downtime alongside rapid deployment is mission critical for your entire engineering and IT department.

Ultimately, rethinking and retooling the IT approach to incident response is imperative to delivering the world-class customer experiences keeping businesses relevant.

Here, we break down the cost of downtime, the competitive advantage of uptime and how a collaborative, DevOps-driven approach to incident management is key to maintaining a culture of availability without slowing innovation.

The Negative Economic Impact of Downtime

For the Fortune 1000, the average total cost of unplanned application downtime falls somewhere between $1.25B to $2.5B annually. The average hourly cost of an infrastructure failure is $100,000 per hour. The average cost of a critical application failure is $500,000 to $1 million per hour (DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified, IDC). The proliferation of deployment demands in parallel to customer expectations of uptime only points to an increase in this cost. This is an expensive problem we need to address.

It’s important to also note that these aren’t outliers limited to the enterprise. Outages (and their costs) affect companies large and small. The impacts of these errors are full of negative externalities beyond cost, including brand reputation and overall customer trust. For example, in 2017, GitLab lost a massive amount of customer data after an error (and subsequent failures of multiple redundant backup protocols). Customer projects, comments and other data were all gone. While source code repositories were safeguarded, it was problematic for a company whose business involved data stewardship (Postmortem of Database Outage on January 31, GitLab).

In the VictorOps 2017 State of On-Call Report, 56% of respondents mentioned revenue impacts as the biggest negative result of downtime in their business. But the survey illustrated the previous point; downtime is more than just revenue and the repercussions of a major outage are felt throughout the entire business.

Competitive Advantage of Minimal Downtime

More advanced companies use historical incident data to proactively prepare teams to resolve incidents faster and prevent those major outages in the first place. This proactive approach to incident resolution, in turn, becomes a competitive advantage as highly functional “on-call” teams help protect revenue loss, maintain brand reputation and drive customer satisfaction.

High-performing teams tend to fare far better than competitors when it comes to both throughput and stability. Recent research demonstrates these high performers are deploying 46x more frequently, with 440x faster lead time from commit to deploy, all while maintaining a mean time to recover (MTTR) that’s 96x faster. And change failure rate? It’s 5x lower, so changes are ⅕ as likely to fail (2017 State of DevOps Report, Puppet).

Why Collaboration Is the Solution

Okay, so downtime sucks and uptime is awesome. Great! Now how do I make that a reality?

There are a variety of ways teams can approach a goal of near-zero downtime (chaos engineering, SRE, etc). However, outages will happen, and when it comes to resolving critical incidents in production, collaboration is the answer.

A collaborative approach to incident response allows teams to:

Enable real-time problem solving
Leverage critical subject matter expertise
Share situation data
Streamline communications and avoid bottlenecks
Reduce on-call burnout
Shared ownership of both code and uptime

Teams looking to positively impact everything from employee retention to customer satisfaction should invest in a collaborative approach to incident management—from tooling to processes. When developers take on-call responsibilities, they learn how the system functions in production, helping them think about reliability as they write and deploy new code.

A DevOps culture of collaboration and transparency reduces cross-functional blind spots, exposes more people to systems in production and provides the team with a holistic understanding of your infrastructure, leading to more reliable systems. Then, when the inevitable happens—an outage does occur—the team has the historical knowledge and the tools to collaborate and quickly remediate the incident.

Ready to DevOps-ify your approach to incident management? Get started with a free, 14-day trial of VictorOps (no credit card requirement). Not only will VictorOps help you and your teams find and resolve incidents faster, we promise it will make on-call suck a whole lot less.

----------------------------------------------------
Thanks!
Todd Vernon

Splunk

The world’s leading organizations trust Splunk to help keep their digital systems secure and reliable. Our software solutions and services help to prevent major issues, absorb shocks and accelerate transformation. Learn what Splunk does and why customers choose Splunk.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Rethinking Availability: Why You Need Collaborative Incident Response

The Negative Economic Impact of Downtime

Competitive Advantage of Minimal Downtime

Why Collaboration Is the Solution

Related Articles

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram