Last October, Splunk Observability Evangelist Jeremy Hicks wrote a great piece here about the Four Golden Signals of monitoring. Jeremy’s blog comes from the perspective of monitoring distributed cloud services with Splunk Observability Cloud, but the concepts of Four Golden Signals apply just as readily to monitoring traditional on-premises services and IT infrastructure. Since a large number of our customers — especially in the national defense establishment — still rely heavily on such IT infrastructure environments, I thought it would be a good idea to address modern approaches to infrastructure monitoring in this context using Splunk Enterprise or Splunk Cloud Platform and Splunk IT Service Intelligence.
Questions like “where do I start?” and “what do I monitor?” are some of the challenges people are faced with when building out their monitoring and observability capabilities. Some things, like host and OS metrics might be obvious, but they only tell a part of the story.
A good place to start, as Jeremy lays out well in his article, is with the Four Golden Signals. So what are they? According to Google’s SRE Book:
“The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.” (source)
I prefer to think of them as the four golden signal categories since that’s what they really describe, but we’ll stick with Google’s (and Jeremy’s) terminology so as not to muddy the waters.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
Simply put, latency is the amount of time it takes to service a request. This can be something very apparent to the end user, such as web page load time, or something more at the backend that contributes to user experience but in itself isn’t apparent to the end user, such as database response time.
This is any measure of demand placed on a system. It could be the number of HTTP requests per second, number of concurrent user sessions, number of database transactions per second, etc.
The rate of requests that fail, e.g., the number of HTTP 500 errors per second, dropped packets on a network interface, I/O errors reported by a disk device, etc. These can be policy errors, too. For example, if you have a service level objective (SLO) for an average page load time of one second, each page load time exceeding one second is an error.
Very simply, saturation is how “full” your system is. Some indicators of saturation will be clear cut, such as file system or physical or logical disk utilization. Others might require experiential knowledge of the environment, such as CPU utilization or memory pressure.
So, what does this all mean from a Splunk Enterprise or Splunk Cloud Platform perspective? Well, as with nearly every other effort with Splunk, it starts with identifying the data sources needed to solve the problem. It also helps at this point to look beyond the infrastructure, and consider the applications and services that rely on that infrastructure. This is also a good time to look at the service level agreement (SLA) that you have with your end user (you do have one of those, right? RIGHT?), and see what service level objectives (SLO) have been established. Your SLOs should drive prioritization for what you’re going to monitor and what data sources you’ll need to do that. If you don’t have established SLOs, don’t let that stop you from pressing on.
The table below contains some suggested data sources, and describes where they align with the Four Golden Signals:
Latency | Traffic | Errors | Saturation | |
Web Server Logs | X | X | X | |
Application Logs | X | X | ||
Host Metrics | X | X | X | |
DB Server Logs/Metrics | X | X | X | X |
Network Logs/Metrics | X | X | X | X |
Virtual Infrastructure Logs/Metrics | X | X | X | X |
The sources in the table are not meant to be exhaustive or all-inclusive, but they do cover the majority of the use cases in traditional IT systems monitoring. You may, for example, be monitoring the components for a manufacturing control system, in which case you’ll likely have some combination of the data sources listed in the table, but also OT/IoT sensor data added to the mix.
Let’s take a closer look at the data sources listed above.
Web Server Logs
Web servers are commonly found at the top of a multi-tiered application stack, and offer a wealth of information about both the end-user experience and the overall health of the application stack. Latency metrics like page load times can often be found here, as well as error metrics like 4xx and 5xx status codes will typically be found here. Traffic metrics such as requests per second can also be gathered from this source.
Some apps or add-ons from Splunkbase that might assist include:
Application Logs
The list of applications that might need monitoring is virtually endless, and trying to list even common examples here would be futile. But at a minimum, one should be able to collect application error events that will give some insight into an application’s health. With luck, your application’s logs will also include some latency metrics. The best approach here is to collect a couple of days worth of application logs in Splunk, then search them for key words and messages to see which event logs provide the most relevant metrics.
Host Metrics
Whether running on bare metal or on virtual infrastructure, metrics around CPU, memory, network, and disk/file system utilization are critical to understanding how heavily loaded - or saturated - the systems hosting your applications are. Splunk makes it easy to collect these metrics when using the add-on appropriate for the OS:
(See the note in the “Virtual Infrastructure Logs and Metrics” section later in this blog for a word of caution on collecting host metrics from virtual hosts.)
Database Server Logs and Metrics
Databases frequently form the underpinnings of many applications, and performance issues there can have a negative impact on the entire application stack. Fortunately, most database solutions provide plenty of performance metrics covering all four of the Golden Signals that can be collected.
Some apps or add-ons for databases from Splunkbase include:
Network Logs and Metrics
Network devices comprise the “plumbing” infrastructure our applications rely on, and monitoring their performance is key to understanding the root causes of outages and bottlenecks. Apps and add-ons for the leading network vendors’ hardware exist in Splunkbase to assist in collecting data from those sources. Among them are:
SNMP, of course, is also an excellent if sometimes voluminous source of information on network health. SNMP polling and trap information can easily be collected using tools like Netflow and SNMP Analytics for Splunk or Splunk Connect for SNMP.
At a minimum, event logs from network devices can be sent to Splunk Connect for Syslog for easy ingestion into Splunk.
While collecting host metrics is obviously important, those metrics will only tell part of the story when the host is running as a guest on virtual infrastructure. Host metrics might tell you a host is running at just 10% CPU utilization, but they won’t tell you how much time that host has been waiting for physical CPU resources. For metrics that are unique to virtual infrastructure, you need to collect metrics from the virtual infrastructure. Once again, Splunk add-ons are here to help:
A word of caution when collecting host metrics from virtual hosts: If you’re collecting virtual infrastructure metrics, you’ll get all the relevant metrics for individual virtual hosts from the virtual infrastructure, so DON’T also collect host metrics via the Windows or Unix/Linux add-on. This will result in duplication if you use the Splunk IT Essentials Work or Splunk IT Service Intelligence apps.
Often when implementing a monitoring solution, there’s not enough thought given to what’s truly relevant and as a result, not enough signals — or worse, too many — are included. It’s my hope that this blog will help you focus your attention on relevant, high-value signals.
Armed with this information, now’s a good time to take a look at what data you’re collecting and see if you’re truly getting what you need for an accurate view of your critical infrastructure and applications. If you think you might want to get another set of eyes on it, reach out to your Splunk account team and have a chat with your assigned solutions engineer or customer success manager. They’ll be glad to help!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.