The availability and reliability of any IT service ultimately govern end-user experience and service performance, both of which have significant business impact.
These two concepts — availability and reliability — are particularly relevant in the era of cloud computing, where software drives business operations, but that software is often managed and delivered as a service by third-party vendors. At the end of the day, availability and reliability are major candidates for the most important metrics in providing IT services.
But how do you measure availability and reliability?
One of the key metrics used for measuring these service dependability characteristics is MTTR. Here’s everything you need to know about the MTTR metric including how it’s calculated, how to achieve a low MTTR and the challenges you might face along the way.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
One of several "failure metrics", Mean Time To Recover (MTTR) refers to the average amount of time it takes to repair or recover from an issue or failure in a system, equipment or process. MTTR can also stand for mean time to repair, mean time to resolve and mean time to resolution, all of which are used interchangeably. You might even see the term using “average time” instead of “mean time”— all different words for the same thing.
It is one of the most visible and useful metrics to determine how well an organization’s IT infrastructure, systems and equipment are performing, and how efficient and effective the IT team is when responding to critical incidents.
In application, the lower the MTTR value, the faster the organization can respond to and recover from incidents impacting service or production availability. MTTR varies based on both internal capabilities and external factors that influence the time taken to restore operations from a failed state to the acceptable operational state before the failure occurred.
(MTTR is just one of many metrics we can consider for incident response.)
When discussing MTTR, it's often assumed to be a singular metric with a uniform interpretation. However, in reality, it encompasses potentially four distinct measurements. The "R" can signify repair, recovery, respond, or resolve, and while these metrics share some common ground, each carries its own significance and subtleties.
MTTR is one of those useful metrics that can be used in a variety of settings. It’s most normally associated with the management of a service, assuring a service is being delivered to its end-users as promised contractually. It’s also useful in software development, like in the DevOps practice of continuous development: as your software development matures, your MTTR will likely shrink.
When calculating MTTR, the clock starts ticking as soon as a failure is detected. MTTR includes the time it takes to diagnose the problem, repair it, test it and any other procedures that must take place before the service is up and running and there is a return to normal operations. Therefore, obviously, a low MTTR is preferable to a high MTTR.
Most service-level agreements (SLAs) between a customer and a service provider or vendor include MTTR in some manner as a guarantee of performance, and a high MTTR can lead to high penalties. It’s important to remember that MTTR represents a typical repair time, not a guaranteed one. A vendor claiming an MTTR of 24 hours is saying that’s how long it usually takes to complete a repair, but individual incidents could take more or less time to resolve.
The purpose of MTTR is to track the time that business-critical systems are unavailable for use, which makes it a valuable metric when analyzing the overall severity and impact of an IT incident. Mathematically, the Mean Time to Recover metric is defined as follows:
MTTR = Time elapsed as downtime / number of incidents
or
MTTR = Time elapsed as maintenance / number of repairs
For any impacted component, the MTTR includes the time that passes from the moment of the incident to the moment you’ve recovered to the a state of operation and availability.
In this context, availability refers to the proportion of time during which a service remains operational under normal conditions. It is calculated as:
Availability = (Total Elapsed Time - Total Downtime) / Total Elapsed Time
This means that functionally, we can consider availability to be the inverse function of MTTR — in other words, as the MTTR increases due to lack of system reliability, the service spends less time in a fully operational and functional state.
In our equation, reliability refers to the probability that the service maintains expected performance standards during its operational state.
Reliability can be used as an attribute of availability, describing how well the service performs during transient states of availability and outages as measured against predefined performance metrics. As a general rule, the reliability of a service decays over its lifecycle and the MTTR metric value increases.
MTTR can fluctuate substantially from component to component, as there are multiple factors influencing availability and reliability. The failure rate may be a constant term defined for an individual hardware or software component, but recovery to the original state of availability may depend on a variety of factors, including internal system dependencies and external factors such as availability of replacement products, tools and services.
When assessing this metric, keep in mind that not all MTTR is the same: the opportunity cost of downtime for ecommerce companies during a peak holiday season is significantly higher than an outage during off-peak seasons. In this context, various modular redundancies can reduce the MTTR to a minimum, creating imperceptible failure incidents.
(See how site reliability engineers improve system reliability.)
MTTR has a strong correlation with business performance. Here are just a few ways MTTR influences business operations and outcomes.
Unplanned outages have a significant impact on end-user experience. MTTR is particularly relevant for cloud-driven enterprises, as the opportunity cost of downtime is entirely dependent on how frequently outages occur and how long it takes to recover from an IT outage incident.
This means that user experience has an inverse correlation with MTTR: the more time it takes for your service to recover from an outage, the more negative the impact it will have on your end-user experience.
The longer it takes to repair or recover from an issue, the more downtime a business experiences. Downtime can lead to:
Faster MTTR reduces the duration of downtime and minimizes its negative financial impact.
Directly relating to downtime costs, a strong MTTR indicates that a business has efficient repair and recovery processes in place. This efficiency not only reduces downtime but also allows resources to be used more effectively, leading to improved overall operational efficiency.
In IT-intensive businesses, MTTR is just as critical for internal systems and services. Disrupted service in key tools can stop employees from being able to perform tasks efficiently, or sometimes entirely — resulting in loss of productivity, employee frustration and loss of revenue.
It’s not uncommon for businesses to have service level agreements (SLAs) with customers that specify minimum MTTR targets. By failing to maintain the agreed-upon MTTR, businesses may face penalties or be challenged for breach of contract.
Fighting for a great MTTR metric is never a “one-and-done” endeavor. Like most things in IT, it's a constant process requiring continuous iteration and attention.
Here are a few ways organizations tackle the ongoing process of maintaining strong MTTR.
If you want to fix an issue, you have to know what it is and where and when it occurred. An advanced IT monitoring solution will give you real-time, uninterrupted data to help you fully understand your system’s performance and give you all the data related to any fault or failure.
Because MTTR measures the capability of an organization to respond to an issue, alerting needs to be highly accurate and effective, as teams will need to be made aware of major issues as quickly as possible to minimize the business impact of an incident.
(Measure monitoring and alerting success with the the MTTA metric.)
The first step to improving MTTR is to understand the incidents that cause it. Thorough root cause analysis of major incidents is key in minimizing MTTR. By understanding what caused a system or component failure, you can implement the appropriate safeguards, replacements, or fixes to prevent the same thing from happening again and again.
Organizations with a carefully planned incident response protocol are much more likely to respond quickly and effectively to issues and therefore have a lower MTTR. For many organizations, this likely includes an IT service management (ITSM) approach. Companies that have successfully undergone full digital transformation may take a more flexible approach, employing cross-functional collaboration tools and constructing specific responses — even explicit checklists — for each incident.
A great solution for many organizations, an automated incident management system can handle the process of sending alerts in multiple channels (phone calls, SMS texts, email, etc.) to all incident responders, reducing the time frame to notify people. The key to any plan, regardless, is to have a clear understanding of who to notify of an incident, how it should be documented and what steps should be taken to rectify it.
Past incidents aren’t just dips on your availability graph — they’re opportunities to learn and prepare for the future. By logging and documenting these incidents clearly, organizations can develop a sort of quick reference guide in cases similar issues arise in the future, ultimately resulting in better MTTR.
(Learn how to hold an incident review or postmortem.)
Just as you might introduce resilience into cloud-based systems to meet agreed SLA terms of service reliability and availability, redundancy is introduced to remove the potential impact of MTTR from a single network node.
Singular node components can be unreliable, but modular redundancy may be inexpensive at the individual component level.
When debating implementing modular redundancy, you should consider both MTTR and MTTF (Mean Time to Failure).
At the end of the day, we can define a highly dependable system as one that is optimized to reduce the sum of MTTF and MTTR to a minimum.
Reducing MTTR is not only a constant process, but it can be increasingly difficult. As new threats emerge and systems become more complex, cybersecurity is in constant flux and IT teams can have increasing potential failure points in a system.
Unfortunately, that’s just the tip of the iceberg. Here are some of the key challenges to keep in mind when assessing MTTR.
One of the challenges surrounding cloud environments is the lack of visibility and control of the infrastructure operations. Without sufficient real-time monitoring data, it may not be possible to determine the true underlying root cause of IT outages — MTTR then becomes a function of complexity and dependencies within your IT environment.
To address these concerns, AI-enabled hyper-automation intelligence technologies can extract relevant monitoring information at the process-level, while evaluating system performance and accounting for dependencies across the end-to-end multi-cloud environment.
You purchased a third-party tool for a reason. Whether that reason is increased functionality, scalability or a lack of internal personnel or resources, failures involving third-party tools can greatly impact MTTR. Because you’ll likely have to rely on external support teams to at least some degree and because you have significantly less visibility over the system or component, your MTTR will inevitably take a hit when a third-party component stops functioning.
(Learn about third-party risk management.)
Approaching detection and triage manually can inflate MTTR significantly. It’s important to incorporate automatic detection and response tools into a system to ensure incident resolution can happen as quickly as possible.
Every second spent resolving system failure is a second that your customers are impacted by the outage. Depending on the service or tool being provided, this can cause a lot of distress for those customers. Lapses in communication during an outage can result in frustration or dissatisfaction. Communication on outages and resolution time frames becomes increasingly difficult when root cause analysis takes longer than expected.
Because not all MTTR is the same, a failure in user communication on a high-impact outage may have rippling effects across the business.
The availability and reliability of IT services significantly influence end-user experience and overall business performance. When measured through MTTR, we can gather valuable insights into the dependability of services and the efficiency of incident resolution processes.
MTTR is a critical indicator for any organization offering a service, be it internally or externally — and overcoming the challenges discussed here can enhance operational efficiency, increase revenue and foster a satisfied customer and user base.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.