What comes to mind when you hear that an IT component has “five 9s availability”? Five 9s availability of >= 99.999% is the peak metric for IT availability.
Five 9s predicts that a measured component — whether it is a server, communication line, app, service, or any other item — will be available at least 99.999% of the time during a specific period.
In this article, let’s get a deeper understanding of what IT availability metrics represent (including five 9s), how availability is calculated, how to gain confidence in availability statistics, and how to improve availability.
In IT, the term “availability” refers to the amount of time a device, service, or IT component is usable. Availability uses past component performance (Total Service Time and Downtime during the measurement period) to estimate and predict future performance.
Availability metrics are used by system designers, auditors, security personnel, vendors, SLA objectives, and other functions in order to:
Availability is commonly expressed as a percentage point metric (0 to 100%), calculated as:
Availability=((Total Service Time)-Downtime) ∶/: (Total Service Time)
Small variations in availability percentages can lead to large variations in downtime, as shown in the table below.
Five 9s are the gold standard and end goal for IT availability. When a particular component reaches five 9s availability, organizations can feel confident in the component’s ability to reliably function under most conditions and to quickly recover when the component fails. Consequently, components with lower IT availability metrics are assumed to be less dependable, more prone to failure, and more likely to benefit from upgrades that will enhance their capabilities.
Your availability calculations will only be as good as the data that goes into them. It can be challenging to find correct and accurate data relating to outages and Downtime. Service Time and Downtime data can be gathered from several diverse sources, including:
Make sure to include all relevant data in your availability metric calculations.
While a valuable performance and reliability evaluation tool, be aware that availability metrics can also lull you into a false sense of security regarding actual component reliability. To increase your confidence, take these items into account when making decisions based on availability metrics.
Choose a reasonable and relevant time period for calculating availability metrics. When pulling metrics, is it relevant to look at data for the last year or the last month? How often should you recalculate availability? Are there some historical events that should not be included in your metrics?
Check your time range to ensure current or one-time data does not inflate or suppress metric values.
It can also be difficult to determine whether an outage qualifies as Downtime.
Review the methodology used for determining and collecting Downtime data to prevent including false positives in your availability metrics.
Five 9s and other availability metrics are necessarily based on past performance. Future availability performance can be affected by many things that may not be present in historical Service Time and Downtime data, including:
After an unanticipated outage occurs, evaluate whether Downtime data from that event should be considered in future availability metrics. These events may also point to additional system improvements that can be implemented for disaster recovery and high availability processing (DR/HA).
(Related reading: infrastructure analytics & website analytics.)
Also be aware of the watermelon effect on component performance. Let’s say a production server has 99.900% availability (10.108 Downtime minutes a week or 8.76 hours a year).
But if those minutes come during peak usage periods — when your Web sites and infrastructure are being hit repeatedly — those outages will affect your business more than if the same outage happened at 3:00 AM Sunday morning.
Like a watermelon, your systems may look green (all clear) on the outside but turn red (fail) on the inside, particularly when stressed. The watermelon effect can hide capacity issues affecting availability, especially when the system experiences high volumes.
IT availability metrics are a simple, valuable tool for analyzing and documenting IT component performance. Correctly defined and calculated, they allow you to measure how well infrastructure components are doing against expectations and to determine whether system upgrades have improved component performance.
Enterprises should strive for five 9s availability for all critical IT components, to ensure each component can reliably function under most conditions and to quickly recover after a component failure.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.