Learn

October 30, 2024

4 Minute Read

What’s Reliability? Reliability Metrics To Know

By Joseph Nduhiu

As IT environments evolve in complexity as a result of never-ending initiatives triggered by changing customer and stakeholder needs, ensuring that IT systems always work reliably is a herculean undertaking.

Indeed, a whopping 82% of respondents from large enterprises are of the opinion that the complexity of IT is actually impeding, hindering, hurting the success promised by digital transformation. (Those promises include productivity, innovation, collaboration, security, and customer satisfaction.)

To ensure that IT delivers reliable systems that the business can depend on in all seasons and scenarios, we must know the most appropriate metrics that effectively track the expected availability and performance.

So, in this article, we will look at the main metrics that organizations should focus on to support reliability requirements for uptime and performance.

What is reliability?

Reliability is defined by NIST as:

The ability of a system or component to function without failure under stated conditions for a specified period of time.

Reliability should be designed within and baked into an IT system. This ensures that the high expectations for uptime and little tolerance to disruption by users are met.

Best metrics to use for reliable services

These metrics will help any organization deliver uptime and performance that is required. In contrast to reliability metrics, you can explore common failure metrics for IT systems.

Mean time between failures (MTBF)

An IT service or infrastructure is deemed reliable when there is a low frequency of outages. Mean Time Between Failures (MTBF) is an availability metric that tracks how often something fails.

Consider the example of a ride hailing mobile app: what’s the average amount of time that passes from one issue (such as unable to request a ride) to another (unable to generate a bill)? The outages can be similar or very different from a root cause perspective, but what matters is the level of stability experienced over time.

When MTBF is high (upwards of weeks and months), an IT service or underlying component is termed as reliable.
The more frequent the failures are, the higher the level of business losses especially considering the impacts of reputational damage, customer churn, and higher associated support costs.

MTBF can be combined with other measures such as MTRS (Mean Time to Restore Service), to give a better picture of service reliability. A high MTBF coupled with a low MTRS are essential ingredients in designing a high availability service.

Rate of occurrence of failures (ROCOF)

Over the life of an IT component, failures are bound to happen — no system is perfect. As the components age, repairs implemented, bugs fixed, spares replaced and other recovery activities carried out, these actions may inadvertently have an impact on how frequently future failures may occur.

The Rate Of Occurrence Of Failures (ROCOF) is a reliability metric that measures the frequency of failures for repairable systems. Due to multiple differing factors that influence outages and repair effects, the ROCOF may be unique to an individual system.

ROCOF can be computed by:

Observing cumulative number of failures for a large number of similar systems over a period of time.
Then, averaging the number over that period.

This metric can provide a trend of how frequently failures are likely to happen especially after warrant periods elapse, major repairs are carried out, or a system has undergone a significant number of maintenance actions. Organizations can use ROCOF data to:

Predict future failure occurrences.
Make informed decisions that can forestall such instances including replacements, migrations, or retirement.

(Related reading: predictive maintenance.)

Probability of failure on demand (PFD)

Once sufficient data on the component performance and past failures has been collected and analyzed, it is possible to forecast the chances of a failure when an IT system is put under load.

The metric probability of failure on demand (PFD/PFOD/POFOD) is defined as the probability that a system will fail to perform a specified function on demand, i.e., when challenged or needed.

This metric is mainly applied to single use systems — such as vehicle airbags or missiles — but may also be relevant for IT systems that have fixed capacity or are non-repairable.

Peak periods are a critical indicator of whether an IT system is reliable:

No one is impressed if IT reports that systems are stable in the middle of the night when few users are online.
But talk about Black Friday or Mother’s Day, and a potential 10x spike in transactions can freeze applications or overload provisioned bandwidth.

By measuring PFD, IT functions are in a better position to predict the chances that IT systems are able to handle demand effectively and avoid saturation.

Error rate

Another reliability metric is error rate which is defined as the rate of requests that are failing. This service level indicator is one of the four golden signals of Site Reliability Engineering (SRE). These signals:

Define what it means for the system to be “healthy”.
Serve as the essential building blocks for any effective monitoring strategy.

Errors are a critical indicator on IT health, as they can indicate issues such as software bugs or hardware failure. Examples of errors include:

The famous HTTP 404 (page not found)
Runtime errors
Logic errors

By measuring the occurrence of errors, IT teams can get a grasp on underlying issues and address them before they snowball to a major outage.

In SRE, the error budget is the metric used to track error rate and forms a control mechanism for diverting attention from innovation to stability when required. This can be thought of as a pain tolerance for users applied to any service dimension.

An error budget is computed as 1 minus the SLO (service level objective - such as availability) of the service, so for example a 99.9% SLO service has a 0.1% error budget which can equate to 2,000 errors allowed in 1 million requests over a specified time period.

(Related reading: SLOs vs. SLIs: what’s the difference?)

Thoughts on reliability

Measuring reliability for complex IT systems is a challenging task. IT organizations need to invest in the right tools that can gather and digest copious amounts of data to generate insights on IT system stability and potential for failure.

But throwing money at this issue without a plan can be a significant risk. The enterprise should focus on measuring what matters most and organize its structure to effectively respond and act to the reliability metrics received from their investment in tools.

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Joseph Nduhiu

Joseph is an ICT consultant and trainer with over 18 years of global experience across multiple sectors. His passion is assisting business units and IT departments in executing their digital transformation strategies and streamlining their operations in line with global standards and best practices. His areas of expertise include business process reengineering, IT service management, project management and cyber resilience. You can connect with Joseph @josephnduhio and on LinkedIn.

Learn 6 Min Read

What’s Reliability? Reliability Metrics To Know

What is reliability?

Best metrics to use for reliable services

Mean time between failures (MTBF)

Rate of occurrence of failures (ROCOF)

Probability of failure on demand (PFD)

Error rate

Thoughts on reliability

Related Articles

Incident Response Plans: The Complete Guide To Creating & Maintaining IRPs

Augmented vs. Virtual Reality: Comparing AR/VR

The API Testing Guide: Top Tools for Testing APIs