In IT and systems resolution, Mean Time to Detect (MTTD) is to the average time it takes your teams and sytems to detect a fault. One part of system reliability, MTTD describes the capacity of a system environment or organization to detect fault incidents.
A reduced or lowered MTTD means that the failure is discovered as quickly as possible — this is good news! However, achieving low MTTD isn’t easy. In fact, it requires exhaustive visibility into system performance and network operations.
That’s not easy to achieve in today’s world, where IT software and apps, manufacturing equipment, and all sorts of systems are distributed and complex.
So, how do you do it? We’ll cover all that and more in this in-depth article.
Observability and monitoring tools continuously analyze performance metrics to identify component failures that may go under the radar — and these failures can hurt. Downtime, loss of customers, loss of critical functionality.
This is especially true for complex enterprise IT environments designed for high availability: undiscovered IT assets and application workloads directly impact the health of the overall IT network.
Here’s a very common example: Take any IT asset that is not observable and monitored in real-time. If this IT asset has any failure, even a partial one, it’s very likely to be overlooked. Indeed, when a fault does occur, the underlying root cause may remain undiscovered (as false positives) for days, weeks, or longer — until an extensive audit is conducted.
(Related reading: root cause analysis explained & what are five-9s?)
Mean Time to Detect has important applications in reliability engineering for a variety of technology functions, especially in:
The metric alone is certainly useful — yet it is more powerful when you look at it in aggregate, across an entire function or even organization. That’s because MTTD closely describes the capacity of an organization and its monitoring tools to identify a fault. In essence, these are dependent on the external factors, and not the product quality itself.
Therefore, we can say: MTTD is not an attribute of the system itself, but an attribute of its implementation, operating environment, users, and engineering teams responsible for monitoring and maintenance.
Although MTTD refers to the average time it takes to detect a fault incident, it does not guarantee that the fault will be detected at, or within, the MTTD duration. And given the complex nature of modern technology, the same failure incident on the same component can vary significantly over time. This is due to the external factors such as the behavior of dependent systems within the IT environment.
For example, network traffic trends are often unpredictable. During a peak holiday season, you may be expecting high traffic to your ecommerce store. At the same time, a DDoS cyberattack incident may be directed toward your servers, introducing fault incidents. Anticipating high traffic due to the holiday shopping season, your teams may program the network load balancer to scale compute resources in your private cloud data centers from a different region. Even with that preparation, it may take time before you can:
This is an example of a unique circumstance that can prevent an organization from detecting a fault. The underlying cause of the entire incident is also external, unpredictable, and uncontrollable.
These characteristics make MTTD interesting in the sense that IT infrastructure and operations teams always have more to do: observability, monitoring, cybersecurity, network administration, and many other IT functions have a role to play in reliability engineering for their IT networks.
So how can you reduce your Mean Time to Detect? Let’s look at a few angles and strategies that can help reduce MTTD — and therefore minimize the overall time it takes to repair a fault in the system:
Fault detection in complex enterprise IT networks is a data-driven problem. Data must be captured continuously and in real-time from all network nodes. By collecting more information in real-time, you can better understand the correlations between the parameters of dependent technology components.
(Related reading: IT and systems monitoring, explained.)
Discover IT assets that operate in an ephemeral state. Understand how load balancers dynamically allocate IT workloads to servers in different locations. The performance of your system is dependent on:
Changes in these parameters can directly impact how your systems behave. Therefore, high visibility into system behavior is required to understand if the underlying cause is an internal system fault or caused by external factors that affect the network behavior.
(Related reading: what is observability?)
Splunk is proud to be recognized as a Leader in Observability and Application Performance Monitoring by Gartner®. View the Gartner® Magic Quadrant™ to find out why. Get the report →
Learn more about Splunk's Observability products & solutions:
Infrastructure operations teams are often overwhelmed by the volume of log data generated in large and complex IT networks.
Instead of relying on fixed metrics thresholds that result in overlooked false positives, look for patterns in log data metrics. Identify anomalies in these patterns and correlate the data trends with system behavior at the component level.
An exhaustive incident management plan is crucial to reduce detection times. Don’t miss blind spots — an important part of the strategy is to develop a monitoring plan for both:
Finally, know that system resilience requires visibility into compute processes and network operations. You may not have access to all the relevant metrics, especially in third-party SaaS services, but external indicators can act as useful starting points.
For example, monitor how user experience and network traffic flows change in response to system anomalies. You may not have access to metrics of a failed subsystem at the public cloud network, but you can program your load balancer and network routing solutions to direct traffic to alternate servers.
This preventive measure may not suffice to identify the underlying incident root cause, but prevents the impact from reaching end-users. In this case, your services continue in the normal operational state despite the fault incident.
MTTD is one measure of system reliability. Other areas to consider:
Here at Splunk, we use our own monitoring, observability, and cybersecurity solutions to power our 24/7 SOC. See how we achieve a 7-minute mean time to detect phishing attacks.
Already use Splunk? Learn how to customize your environment to achieve the lowest MTTD in this hands-on Tech Talk.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.