Two things drive customer satisfaction more than pretty much anything else:
Software developers and products support teams should always be working to improve on both. But — where to start? Collecting and analyzing the right metrics is one way that your organization can improve the efficiency of incident management and overall application quality. Still, these questions remain:
Read on to learn about eight key metrics essential to incident response. Discover how these metrics and incident management KPIs can provide insights that add value to your customers — both in the quality of your application and the efficiency of your incident response strategy.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
Generally speaking, there’s two options that DevOps and IT organizations have to improve customer experience with a given product:
Incident response informs both areas of development. These incident response metrics help organizations take these steps more thoughtfully:
The first metric to analyze, issue classification, may also be the most impactful. Here, we want to:
Repeated failure of the same functionality will likely trace back to the same root cause which, when resolved, could fix the problem for good when moving forward.
By extension, application slowness may be the result of improper query construction. Simply optimizing these queries could lead to better performance and happier customers.
MTTD measures the average time it takes for the security team to detect a security incident from the moment it occurs. If your MTTD is 5 hours, it means that it takes an average of 5 hours to identify a security incident after it has occurred.
Evaluating the MTTD is very important because the sooner you detect an incident, the quicker you can respond, and this will minimize potential damage.
Equifax, in 2017, suffered a massive data breach that revealed the private information of 147 million people. The breach went undetected for several months (70+ days), highlighting the importance of reducing MTTD. A lower MTTD could have significantly reduced the impact of the breach.
As an organization, you can implement real-time monitoring tools and automate anomaly detection to quickly identify potential weaknesses.
MTTA is the average time it takes for an incident response team to acknowledge a reported incident, and it can reveal a lot about the effectiveness of your overall incident management practice. While the acknowledgment time for any particular incident may not indicate a trend, calculating the mean time to acknowledgement can help determine if you need to improve incident management strategy needs improvement.
A better incident management strategy can facilitate faster response times and let customers know they’re not forgotten—going a long way towards customer satisfaction. These alterations could include:
Similarly, another important incident response metric to track is the time to resolution for reported incidents. You can certainly average these out — that’s where the mean time MTTR metric comes in. But sometimes, for outlier situations, you want to focus simply on how long it takes to solve the current incident you’re working on.
The goal, of course, is to resolve incidents as quickly and efficiently as possible. Calculating the mean time to resolve and the average time to resolve for particular issues can provide insights that suggest where to focus on improving your incident response strategy. (MTTR isn't limited to incident response: it's also an important failure metric for IT systems.)
Sometimes, improving documentation, communication, and knowledge sharing alone can reduce MTTR. But you might need to dig deeper or make bigger changes to significantly improve efficiency in this area.
When Carrefour, the eighth-largest global retailer, wanted to improve customer experience across its online channels, it focused on improving MTTR by using actionable insights into system performance. This MTTR improvement means Carrefour is now…
MTBF is the mean time between one system failure and the next. It's commonly used to access the reliability of hardware, software, or systems.
Here, a higher number is good: the more time between failures for a given system or service, the greater your system reliability — reducing the likelihood of incidents caused by system failures.
In December 2021, Amazon Web Services (AWS), a major cloud service provider, experienced a significant outage primarily affecting the Northern Virginia (US-EAST-1) region due to a severe failure. The incident highlighted the importance of monitoring MTBF to ensure system reliability and prevent downtime.
To improve on your MTBF:
Mean time to inventory (MTTI) is the average amount of time it takes an organization to identify and log a new device, system, or software into its inventory of IT assets after it connects to the network.
MTTI is also a very important metric because it ensures that all devices and systems on a network are identified and monitored quickly, thereby reducing the risk of unauthorized access or attacks. A shorter MTTI will help an organization maintain better visibility and control over their IT assets, which is essential for preventing breaches or attacks.
In September 2019, threat actors gained unauthorized access to SolarWinds’ network. This attack went undetected for several months, until around March 2020. The SolarWinds cyberattack highlighted the importance of rapid asset inventory. Attackers exploited systems that weren't properly tracked or monitored.
Organizations with low MTTI were better positioned to identify compromised assets quickly and respond effectively, while those with high MTTI struggled to even know what was on their networks. This shows why MTTI is a critical metric for cybersecurity resilience.
To improve your MTTI, consider:
Tracking exactly when each incident occurs can also highlight important trends — even if the incidents are seemingly unrelated.
How often are incidents being escalated or rerouted to different units within the organization? If the answer is very often or too often, there would likely need to be some alterations made to the incident response strategy.
The goal, of course, is to have your primary incident response team handle as much of the resolution as possible, without needing to escalate. Still, escalations are inevitable, as certain issues may require specific expertise on another team, for instance.
So what changes can you make to lower the number of incident escalations?
Use incident severity levels to your advantage. And be aware that severity is not the same as priority.
It’s easy to see how collecting and analyzing the right incident response metrics can improve the incident management process and enhance application quality. But why is this so important?
Look no further than online retailers, financial institutions, and social media companies. Slow incident response times and frequent application issues can quickly sully a company’s reputation, leaving you to fight an uphill battle against your competitors.
But a positive customer experience can mean the difference between being the go-to organization or being completely irrelevant. Just ask Papa Johns, the world’s third-largest pizza delivery company. To keep all its operations running smoothly, it needed visibility into its complex hybrid environment. Today, the team can find and fix issues fast:
“It used to take us days to find out about issues with a new release. Now with our custom dashboard built with Splunk Dashboard Studio, we can pinpoint and fix a problem on the same day so that customers can place orders seamlessly,” says Willie James, director of resiliency services at Papa Johns.
Reliability and prompt issue resolution can help cultivate trust between an organization and its customer base, leading to recurring customers and a positive reputation that draws in new customers.
SLA compliance measures how well an organization adheres to the terms of its service level agreements (SLAs). SLAs are formal agreements between service providers and customers that define the expected level of service, including:
Both SLAs and SLOs (service level objectives) play a critical role in incident management by defining the expected response and resolution times for incidents, ensuring that teams prioritize and manage incidents effectively to meet customer expectations. By meeting SLA and SLO targets, teams can minimize downtime, reduce the impact of incidents, and maintain high service quality.
Here at Splunk, we use our own monitoring, observability, and cybersecurity solutions to power our 24/7 SOC. See how we achieve a 7-minute mean time to detect phishing attacks.
Already use Splunk? Learn how to customize your environment to achieve the lowest MTTD in this hands-on Tech Talk.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.