Software monitoring, how does it work?
“We paid for a bunch of tools but we don’t know what we should be looking at. There are tons of charts that don’t seem to mean anything!”
If you talk to people about software monitoring you’ve inevitably heard something similar to this. With so many possible metrics it can feel like searching for a needle in a haystack. Even with curated dashboards there is inherent confusion about what is important. A great way to get started is to apply the 4 “Golden Signals” of Latency, Errors, Traffic, and Saturation (L.E.T.S.). These four concerns provide a fairly generic framework you can use to understand your software and infrastructure.
But they can also be applied to non-software related scenarios! Interested? Read on!
Let’s create a hypothetical non-software example to illustrate the power of the Golden Signals! Imagine you run a busy restaurant. The restaurant seems to be doing really well, but you don’t quite know where to look to make improvements or cut costs, so you decide to start measuring. How do you decide on what to measure? Applying L.E.T.S. you might be concerned about:
Monitoring these concerns would allow you to make informed decisions on scaling aspects of your business and the impact of any changes.
Latency metrics will help you decide if you need to hire more cooks, servers, or upgrade equipment.
Errors will help you measure improvements from better training, staffing, and equipment.
Traffic helps you understand how much staff you need, when you need them most, and when you can schedule fewer. Measuring customer traffic may even help you decide when it is time to expand!
Saturation can help uncover scheduling deficiencies, issues preparing certain popular dishes in parallel, and other unknown efficiency gaps.
These are all things you may have been able to guess about as a restaurant owner but without measuring them, how would you know for sure?
These basic concepts provide a basis for understanding complex systems in general; like our imaginary restaurant. But where they really shine is monitoring complex software architectures!
In the age of microservices, specific domain knowledge of every element of a software system may be impractical. Applying the concept of L.E.T.S. can provide the foundation for basic troubleshooting of where issues arise in a complex system.
An IT analyst who isn’t an expert on a given service can use Latency, Errors, Traffic, and Saturation to more readily identify issues in connected systems:
This sort of foundational knowledge allows us to quickly check known points of failure before diving down rabbit holes. Not sure if you’re already measuring these sorts of things? Keep reading!
Figure 1-1. Splunk APM highlighting the L.E.T.S. metrics produced from Checkout to Payment in Hipster Shop. That 15% error rate is something we should look into!
Now that you have your conceptual framework for a minimum set of four metrics. Where do you get them? Distributed Tracing at it’s very core is about the latency, errors, and traffic of requests traversing a system. When you feed your tracing data (sometimes called APM data) into a solution like Splunk APM you’ll start to get those metrics right away! Easy peasy. But that still leaves saturation.
Saturation is a bit more up to your software and design decisions. Consider a couple of examples of saturation:
The answers to some of the above are likely “no” for any given application in your environment. But taking the time to think about that and map out where resource constraints and saturation may cause failures will help reduce chart clutter and increase troubleshooting speed. Knowing the “known knowns” will help you start to focus on the issues at hand and reduce side tracking.
So the answer to “what should we be monitoring?” is simple. L.E.T.S.! Look at the points of Latency, Errors, and Traffic between microservices, data centers, even between individual software components. Applying these methods across microservices that share common infrastructure patterns (E.G. JVMs running on EC2 and using DynamoDB, Python based Cloud Functions with a Cloud SQL datastore, or any other repeatable combination) will also allow you to minimize things like Dashboard and Alert bloat. Imagine a single dashboard containing L.E.T.S. charts for each piece of commonly used infrastructure. By including a dimension like `servicename` across all of those metrics that single dashboard can be easily filtered to quickly view a large swath of your microservices footprint. Alerts can be minimized similarly by focusing on the L.E.T.S. fundamentals and repeatable infrastructure patterns. But let’s save that story for another time.
Regardless of if you’re a seasoned Splunk Observability user, just starting a trial, or just thinking about getting your feet wet. Keep these principles in mind and you’ll quickly be on your way to greater observability into your software and infrastructure!
You can sign up to start a free trial of the Splunk Observability Cloud suite of products today!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.