As IT environments evolve in complexity as a result of never-ending initiatives triggered by changing customer and stakeholder needs, ensuring that IT systems always work reliably is a herculean undertaking.
Indeed, a whopping 82% of respondents from large enterprises are of the opinion that the complexity of IT is actually impeding, hindering, hurting the success promised by digital transformation. (Those promises include productivity, innovation, collaboration, security, and customer satisfaction.)
To ensure that IT delivers reliable systems that the business can depend on in all seasons and scenarios, we must know the most appropriate metrics that effectively track the expected availability and performance.
So, in this article, we will look at the main metrics that organizations should focus on to support reliability requirements for uptime and performance.
Reliability is defined by NIST as:
The ability of a system or component to function without failure under stated conditions for a specified period of time.
Reliability should be designed within and baked into an IT system. This ensures that the high expectations for uptime and little tolerance to disruption by users are met.
These metrics will help any organization deliver uptime and performance that is required. In contrast to reliability metrics, you can explore common failure metrics for IT systems.
An IT service or infrastructure is deemed reliable when there is a low frequency of outages. Mean Time Between Failures (MTBF) is an availability metric that tracks how often something fails.
Consider the example of a ride hailing mobile app: what’s the average amount of time that passes from one issue (such as unable to request a ride) to another (unable to generate a bill)? The outages can be similar or very different from a root cause perspective, but what matters is the level of stability experienced over time.
MTBF can be combined with other measures such as MTRS (Mean Time to Restore Service), to give a better picture of service reliability. A high MTBF coupled with a low MTRS are essential ingredients in designing a high availability service.
Over the life of an IT component, failures are bound to happen — no system is perfect. As the components age, repairs implemented, bugs fixed, spares replaced and other recovery activities carried out, these actions may inadvertently have an impact on how frequently future failures may occur.
The Rate Of Occurrence Of Failures (ROCOF) is a reliability metric that measures the frequency of failures for repairable systems. Due to multiple differing factors that influence outages and repair effects, the ROCOF may be unique to an individual system.
ROCOF can be computed by:
This metric can provide a trend of how frequently failures are likely to happen especially after warrant periods elapse, major repairs are carried out, or a system has undergone a significant number of maintenance actions. Organizations can use ROCOF data to:
(Related reading: predictive maintenance.)
Once sufficient data on the component performance and past failures has been collected and analyzed, it is possible to forecast the chances of a failure when an IT system is put under load.
The metric probability of failure on demand (PFD/PFOD/POFOD) is defined as the probability that a system will fail to perform a specified function on demand, i.e., when challenged or needed.
This metric is mainly applied to single use systems — such as vehicle airbags or missiles — but may also be relevant for IT systems that have fixed capacity or are non-repairable.
Peak periods are a critical indicator of whether an IT system is reliable:
By measuring PFD, IT functions are in a better position to predict the chances that IT systems are able to handle demand effectively and avoid saturation.
Another reliability metric is error rate which is defined as the rate of requests that are failing. This service level indicator is one of the four golden signals of Site Reliability Engineering (SRE). These signals:
Errors are a critical indicator on IT health, as they can indicate issues such as software bugs or hardware failure. Examples of errors include:
By measuring the occurrence of errors, IT teams can get a grasp on underlying issues and address them before they snowball to a major outage.
In SRE, the error budget is the metric used to track error rate and forms a control mechanism for diverting attention from innovation to stability when required. This can be thought of as a pain tolerance for users applied to any service dimension.
An error budget is computed as 1 minus the SLO (service level objective - such as availability) of the service, so for example a 99.9% SLO service has a 0.1% error budget which can equate to 2,000 errors allowed in 1 million requests over a specified time period.
(Related reading: SLOs vs. SLIs: what’s the difference?)
Measuring reliability for complex IT systems is a challenging task. IT organizations need to invest in the right tools that can gather and digest copious amounts of data to generate insights on IT system stability and potential for failure.
But throwing money at this issue without a plan can be a significant risk. The enterprise should focus on measuring what matters most and organize its structure to effectively respond and act to the reliability metrics received from their investment in tools.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.