Does Observability Throw You for a Loop? Part One: Open with Observability

By Splunk

The duality of observability is controllability. Observability is the ability to infer the internal state of a "machine” from externally exposed signals. Controllability is the ability to control input to direct the internal state to the desired outcome. We need both in today's cloud native world.

Quite often we find that observability is presented as the desired end state. Yet, in modern computing environments, this isn’t really true. After all, how many times does an application stop working (or deliver incorrect results) and the response is to shrug and walk away? We need to move from a linear model to a loop model, from “See something, Say something” to “See something. Do something”.

So observability is a loop problem. And we need to stop treating it as the end state of our challenge in delivering performant, quality experiences to our users and customers. So let’s break this apart into components.

Observe and Control

In this view, observability is a quality of software, services, platforms, or products that allows us to understand how systems are behaving. It’s a window into the operating state of our applications and systems. And while it may conceptually extend beyond monitoring, it starts there. With observability comes alerting (hopefully). After all, when something goes wrong we need to be able to become aware and respond as quickly as possible. And in fact, this alerting capability may actually distinguish between monitoring and observability — detecting and alerting on what we know might go wrong (monitoring) versus detecting when something is wrong that we didn’t foresee (observability). We can consider that observability lets us see the activities (monitoring) and reduce the mean time to detection by showing us both known and unknown activities within our systems.

In some ways, observability should be monitoring at the Chuck Norris level, where action can lead to a response, even if we didn’t expect the original action.

Controllability is That Response

Controllability consists (at a coarse level) of two sections. And one is often forgotten or lumped together into one concept. But there are really two main points, mean time to respond and mean time to resolution. Your immediate response is to get things working again and that can have a different path than resolution path, which is to identify the underlying causes and resolve them to allow continued development and continuous performance improvement. So let’s look at a reasonably simple example.

The Cat and the Canary

Napping Cat You run a social platform that allows people to upload pictures of their cats napping (CatNapFriends). The app has a number of microservices and you’ve begun to introduce serverless elements for image processing. The application has been running quite well, meeting scale and is responsive, leading to happy purring users.

You’ve found that your function performs a bit slow at lower volumes of traffic, so you change the function a bit to make it faster. Your testing shows that it runs 50% faster in your tests, so you are ready to deploy. Being a smart, safe and sane person, you roll out in a canary model. And something goes wrong as it scales out, your serverless functions start hitting durations into the seconds or returning errors.

And your users aren’t so happy anymore. Your Twitter feed blows up, and your Facebook is something you’d rather not see.

So the issue then is:

How fast did you recognize the problem?
How fast did you let someone know?
How fast did you get back to a performant system?
How did you figure out the root causes?

Each of these questions can have multiple answers, but some answers are probably better for your business and honestly, your sanity.

Start with Monitoring

How do we know something went wrong? Well, if you can’t see it, it never happened. Until, of course, you get a call from the CIO asking why the national news is asking why your site is down. Learning about problems from users on Twitter might be a career-limiting move.

So we start with monitoring. While monitoring sometimes gets lumped into the category of seeing things we already know, monitoring in observability can also flag things that are things we don’t know… yet. Sometimes you’ll hear observability as the ability to find the unknown unknowns while limiting that to the root causes analysis. Stellar use of observability means also detecting the unknowns long before we are in the resolution stage.

Granularity and fidelity play a major role in your monitoring and detection. So let’s imagine that the serverless scenario is the problem underway. You grab a data point every 5 seconds. No problem. But wait, in serverless cold starts are between 200-700 ms; warm starts are 8-50 ms. You just missed it. In a 5 second window you can miss A LOT of serverless starts. You’ve just missed it. And when we get to distributed tracing it can be even worse. You need to see every point, every bit of data on traces that you can. It’s useful in this monitoring and detecting phase but will become increasingly crucial in the resolution phase.

Now that we have the data we need to move to the next phase in our loop, detection.

Detect and Alert

Detection is obviously realizing something went wrong (or out of band). In traditional monitoring, it is most often concerned with static thresholds. But there are a lot of other choices, like Heartbeat Check, Resource Running Out, Outlier Detection, Sudden Change, Historical Anomaly and Custom Threshold.

As you can tell, the list covers both things we already know, (static, heartbeats) but also starts verging into the unknowns (outliers, sudden changes). And as we move into move AI/ML categories, we start to see even more observable events leading to unknown detectors (already represented by historical anomaly).

Detection leads to alerting. And detection/alerting gets us to the first open/closed loop bifurcation. An open loop is one which has a person as a triggering element. A closed loop is one in which an automated element is the triggering element. In detection, an open loop is when an operator spots a problem (like a metric out of range) on a monitoring dashboard and alerts responsible persons (as an example). A closed loop would be one where the dashboard highlights the change/trigger (as in flashing red) and/or launches an automated alert. There are multiple combinations that can occur, but with observability, you want to figure out how best to keep to a closed loop process for as long as possible. You’ll still need open loops in places, but in detection and alerting in particular, a closed loop gives you the best results.

But controllability is where the loop really closes — and we’ll cover that more in part two.

Find out more about Observability and what it means for you with Splunk.

----------------------------------------------------
Thanks!
Dave McAllister

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.