Application quality and incident response pack a punch when it comes to customer experience. A software team should always be working to improve on both. But — where to start?
Collecting and analyzing the right metrics is one way that your organization can improve the efficiency of incident management and overall application quality. Still, the questions remain:
Read on to hear about five key metrics essential to incident response. Discover how these metrics and incident management KPIs can provide insights that add value to your customers — both in the quality of your application and the efficiency of your incident response strategy.
The best part? We’ve included real-world examples that show just how these metrics help organizations get better.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
DevOps and IT organizations have two options when it comes to providing a better customer experience with their product:
Now, let’s look at five metrics that can help an organization take these steps more thoughtfully:
(Calculate how much a critical IT incident will cost you today.)
The first metric to analyze, issue classification, may also be the most impactful.
Here, we want to look at the most reported issues with the given application. Track commonly reported errors and/or performance issues and report these to the development team for root cause analysis via post-incident reviews. Repeated failure of the same functionality will likely trace back to the same root cause which, when resolved, could fix the problem for good when moving forward.
By extension, application slowness may be the result of improper query construction. Simply optimizing these queries could lead to better performance and happier customers.
(Improve your incident review/postmortems with these best practices.)
The time it takes for an incident response team to acknowledge a reported incident can reveal a lot about the effectiveness of your overall incident management practice. While the acknowledgment time for any particular incident may not indicate a trend, calculating the mean time to acknowledgement (MTTA) can help determine if your incident management strategy needs improvement.
A better incident management strategy can facilitate faster response times and let customers know they’re not forgotten — going a long way towards customer satisfaction. These alterations could include:
Similarly, another important incident response metric to track is the time to resolution for reported incidents. The goal, of course, is to resolve incidents as quickly and efficiently as possible. Calculating the mean time to resolve (MTTR) and the average time to resolve for particular issues can provide insights that suggest where to focus on improving your incident response strategy. (MTTR isn't limited to incident response: it's also an important failure metric for IT systems.)
Sometimes, improving documentation, communication and knowledge sharing alone can reduce MTTR. But you might need to dig deeper or make bigger changes to significantly improve efficiency in this area.
Here’s a real-world example: When Carrefour, the eighth-largest global retailer, wanted to improve customer experience across its online channels, it focused on improving MTTR by using actionable insights into system performance. This MTTR improvement means Carrefour is now…
(Discover what Carrefour calls “the cornerstone” of their security operations.)
Tracking exactly when each incident occurs can also highlight important trends–even if the incidents are seemingly unrelated. For example:
Knowing this type of information provides insight that allows the development team to track problems quickly and more easily.
Are incidents frequently being escalated or rerouted to different units within the organization? If this is the case, there would likely need to be some alterations made to the incident response strategy. These changes can range widely, for example:
(Use incident severity levels to your advantage. And beware that severity is not the same as priority.)
It’s easy to see how collecting and analyzing the right incident response metrics can improve the incident management process and enhance application quality. But, why is this so important?
Look no further than online retailers, financial institutions and social media companies. Slow incident response times and frequent application issues can quickly sully a company’s reputation, leaving you to fight an uphill battle against your competitors.
But a positive customer experience can mean the difference between being the go-to organization or being completely irrelevant. Just ask Papa Johns, the world’s third-largest pizza delivery company. To keep all its operations running smoothly, it needed visibility into its complex hybrid environment. Today, the team can find and fix issues fast:
“It used to take us days to find out about issues with a new release. Now with our custom dashboard built with Splunk Dashboard Studio, we can pinpoint and fix a problem on the same day so that customers can place orders seamlessly,” says Willie James, director of resiliency services at Papa Johns.
(See how Papa Johns keeps up with increased customer demand — and innovates faster.)
Reliability and prompt issue resolution can help cultivate trust between an organization and its customer base, leading to recurring customers and a positive reputation that draws in new customers.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.