From website reliability to data analytics and cloud infrastructure, modern IT services must perform reliably to meet user expectations. Achieving this requires clear performance targets that balance functionality, cost, and reliability.
Service level objectives (SLOs) play a key role in defining the expected performance of a service through quantifiable metrics.
In this article, we explore the concept of SLOs, their role in ensuring service dependability, and how they fit within larger agreements like service level agreements (SLAs).
Service level objectives (SLOs) are a critical framework for defining the measurable performance expectations of a service. They refer to the performance targets of a service outlined in a contract for a third-party service.
The performance metrics are based on the dependability goals of a service. A technology service is considered as dependable if the users can rely on it to deliver the expected functionality over a given time duration.
SLOs serve as a formalized, measurable framework that helps define and communicate the services overall:
SLOs are integral to maintaining a balanced relationship between service providers and users, as they set concrete performance targets, often described in contracts or service level agreements (SLAs).
Some of the key metrics governing dependability of a service include mean time to failure (MTTF) and mean time to repair (MTTR). MTTF represents the average duration of correct operation for a service, while MTTR refers to the average time to recover from a failure incident.
These metrics in turn define service performance metrics such as availability (such as six 9s availability, or a service that is available 99.9999% of the time in a year. In other words, the expected downtime is 31.56 seconds per year).
These numbers are outlined in a larger service level agreement that describes the legal aspects of service expectations, service deliveries and the commitments therein.
(Related reading: reliability metrics.)
So how is the SLO different from SLA? Consider the SLO to be an individual clause within the SLA that quantifies and targets a specific metric objective. For example, an SLA may commit to 99.9999% availability.
To achieve this, the SLA may include objectives related to the MTTR. For example, if a cloud instance fails and the traffic must be provisioned dynamically via a different instance, it may not take more than a few minutes. During this time, the overall performance of the Web app may slow down for a fraction of the user base only.
This slowdown may translate into a total downtime impact of less than 5 seconds, or one-sixth of the agreed downtime provisions at any given instance, according to the six-9s (99.9999% availability) SLA agreement.
(Explore Splunk’s report on The Hidden Costs of Downtime.)
A service level objective will define these performance expectations in terms of quantifiable metrics. The goal of the SLO is to optimize a tradeoff between:
By clearly outlining these tradeoffs with quantifiable metrics, your DevOps teams and site reliability engineers (SREs) can manage infrastructure operations to meet these guidelines.
But as described in the example above, the key challenge is to interpret and translate the SLO into meaningful metrics. How can you find functional relationships between individual metrics to downtime impact?
A slow MTTR on a cloud instance that overburdens a network may have negligible impact if the ISP and Web service provider have strong network routing and Web cache services in place.
Conversely, a fast MTTR is irrelevant to determine downtime impact if the network resource allocation is not optimized or highly sensitive to any fault incident. Now, add to these challenges, the overall network complexity and external factors that are highly co-dependent but beyond the control of both the service provider and the service user.
To resolve these challenges, the SLO breaks down the multivariate problem of service level performance into objective and actionable guidelines. This allows DevOps and engineering teams to have some control over the system performance and eliminate uncertainties. This is especially relevant to highly sensitive parameters that are difficult to measure, track and govern.
By measuring these metrics, developing a service expectation based on these measurements and outlining them as well-defined service level objectives is a useful starting point. And even more so, in the cloud industry where internal DevOps and IT teams have limited visibility and control into the infrastructure operations of their cloud providers.
By specifying performance metrics as SLOs within the SLA agreement, the responsibility to meet the SLA terms rests on the cloud vendors. From a business perspective, all they need to understand is how to optimize service performance and availability goals to their expectations of:
In this sense, individual SLOs are more relevant to the service provider than the service user. Once the SLA agreement is in place and includes the desired SLOs, as a service user you can now focus on the real performance numbers (also called Service Level Indicators).
An SLI may be the real metric performance evaluated against the SLO. Now, the goal of the service provider is to bridge any gap between the SLOs and their corresponding SLIs as measured by the service user.
(Related reading: SLA vs. SLI vs. SLO: Understanding Service Levels.)
Almost every technology-driven business organization must rely on third-party services. The quality and performance of these services can determine the business value on their technology investments.
The only challenge then remains to identify the most relevant and impactful metrics and performance expectations that align with your business goals and limitations.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.