How are you tracking the long-term operation and health indicators for your micro and macro services? Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are prized (but sometimes “aspirational”) metrics for DevOps teams and ITOps analysts. Today we’ll see how we can leverage SignalFlow to put some SLOs Error Budget tracking together (or easily spin up same with Terraform)!
Depending on your organization SLOs may take many forms:
If you’ve read the Google SRE handbook (and who hasn’t?) you’ll be familiar with SLO and SLI. But if you haven’t, try this quote on for size.
"SREs’ core responsibilities aren’t merely to automate “all the things” and hold the pager. Their day-to-day tasks and projects are driven by SLOs: ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term. One could even claim that without SLOs, there is no need for SREs." - Google SRE Handbook Chapter 2
As stated SLx (SLO/SLI/SLA) is most often concerned with trends over time. Terms such as “Error Minutes”, “Monthly Budget”, “Availablity per quarter”, and the like are common in discussions of SLx. So how do we use our charts, which track trends over time, and our alerts, which notify us of events in time, to create an SLO? We can use the features of SignalFlow!
Figure 1-1. Example Error Budget using Alert Minutes
So you want to create an SLO, specifically based on the concept of “Downtime Minutes”. What ingredients would you need to cook that up?
With what we’ve listed above it sounds like #1 means some kind of alert on “success rate”. #2 means a way to force alerts to happen every minute while in an alerting state. While #3 would take the count of those minute long alerts and compare it against a constant number (the budgeted number of Downtime Minute).
Fortunately, SignalFlow gives us the ability to create these sorts of minute long alerts and a way to track the number of alerts in a given cyclical period (week/month/quarter/etc.)
Signalflow allows us to track Alerts as a timeseries using the `alerts()` function to count the given number of alerts during a period of time.
This requires:
filter_ = filter('sf_environment', '*') and filter('sf_service', 'adservice') and filter('sf_kind', 'SERVER', 'CONSUMER') and (not filter('sf_dimensionalized', '*')) and (not filter('sf_serviceMesh', '*')) A = data('spans.count', filter=filter_ and filter('sf_error', 'false'), rollup='rate').sum().publish(label='Success', enable=False) B = data('spans.count', filter=filter_, rollup='rate').sum().publish(label='All Traffic', enable=False) C = combine(100*((A if A is not None else 0)/B)).publish(label='Success Rate %') constant = const(30) detect(when(C < 98, duration("40s")), off=when(constant < 100, duration("10s")), mode='split').publish('Success Ratio Detector')
What are we doing in this Alert?
Essentially we have made an alert that will fire once every minute that the metric is breaching the alertable threshold.
Next, we need to create a chart that tracks these Alert Minutes. Additionally, we’d like to have that chart reset monthly so we can know if our Error Budget for the month has been used up.
## Chart based on detector firing AM = alerts(detector_name='THIS IS MY DETECTOR NAME').count().publish(label='AM', enable=False) alert_stream = (AM).sum().publish(label="alert_stream") downtime = alert_stream.sum(cycle='month', partial_values=True).fill().publish(label="Downtime Minutes") ## 99% uptime is roughly 438 minutes budgeted_minutes = const(438) Total = (budgeted_minutes - downtime).fill().publish(label="Available Budget")
What are we doing here?
Figure 1-2. Example charting of Alert Minutes
SignalFlow is an incredibly powerful tool! Some of its advanced features can lead you to really interesting discoveries!
As seen in the example above we are able to use SignalFlow to create both Alerts and Charts that work together to create a new way to view our SLO/SLx concerns. For more detailed examples and deep SignalFlow usage check out the SignalFlow repo on Github.
Splunk Observability provides you with nearly endless possibilities! Think of this article as a jumping off point for using SignalFlow in more advanced ways in Splunk Observability.
To easily get these types of SLO / Error Budget tracking functions in Splunk Observability using Terraform, checkout the Observability-Content-Contrib repo on GitHub! If you’re doing something cool with Splunk Observability, please consider contributing your own dashboards or detectors to this repo.
If you haven’t checked out Splunk Observability you can sign up to start a free trial of the Splunk Observability Cloud suite of products today!
This blog post was authored by Jeremy Hicks, Observability Field Solutions Engineer at Splunk with special thanks to: Bill Grant and Joseph Ross at Splunk.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.