In part one, we introduced the duality of observability, controllability. As a reminder, observability is the ability to infer the internal state of a "machine” from externally exposed signals. Controllability is the ability to control input to direct the internal state to the desired outcome.
So observability is a loop problem. And we need to stop treating it as the end state of our challenge in delivering performant, quality experiences to our users and customers. In short, from “See something, Say something” to “See something, Do something”.
It’s great to know something has occurred (or is about to occur). However, if you can’t take action, then while the crash may be exciting, it won’t be cheap. And action can mean many things, but it normally falls into 2 steps.
Response is the first step in control (and recovery). You may need to think of this in triage terms, can it live without immediate attention, will it die no matter what, or will it live with immediate attention. Fortunately for us, an application “death” is resurrectable.
So think of it this way. “My service is running slower than normal or desired, but work is still happening. Let me figure out what is wrong and then fix it.” Or you can end up with “my service is dead or inhibiting work, let me get back to steady-state, then find and fix it”.
Response is a natural fit for automation techniques. Depending on the nature of the alert and surrounding events, it may be highly likely that an automated response (via a runbook, or trigger to a different action script) could be kicked off.
In our part one example, CatNapFriends, we pushed a new version of the function. Our monitoring noticed a detrimental change and our alerts fired. Now, assume that one of those alerts also connected to a trigger/script that immediately rolled our updates back to the last known good distribution. It would also alert our appropriate person to examine the problem and start looking into the root causes. In every case, an appropriate alert should be sent to inform the right group of the problem, the response and basically, keep people informed.
Which leads us into resolution.
Resolution is almost always an open loop. When we need to fix it moving forward, we’ll dive into code, infrastructure and/or configurations. And that will take insight not only into the immediate issue for resolution but also the other items that might affect or be affected by that resolution. For CatNapFriends and its sudden slow behavior, it could be a coding bug that fails to return the completion of the image processing. It could be that your serverless configuration was set to the minimum memory and the image is swamping the container. It could be that you have an unforeseen interaction with the new service and other elements. Fortunately, observability considers this visibility need and expresses it in connection to the logs.
These are usually called the pillars of observability; monitoring, tracing, logs.
And it continues over to the controllability concepts; respond, resolve, redeploy.
We need for observability to close the loop and we need tools and techniques that allow us to do that with speed and precision and at scale. It’s a loop, leading through each phase and returning to the next. We work to ensure that we are most often in the monitoring aspects but need to be aware that all of the other steps are readily available.
So, while observability takes its cues from control theory, the practical approach takes it from computer process control and SCADA (Supervisory Control and Data Acquisition) implementations. Just visibility into your system is good, but not enough to manage today’s ever-increasing complexity of containers and Kubernetes, microservices, and serverless functions. However, with great observability comes the need for great control, be they open loops or closed.
Find out more about observability and what it means for you with Splunk.
----------------------------------------------------
Thanks!
Dave McAllister
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.