Observability

February 14, 2024

7 Minute Read

Custom Metrics and their importance in Observability

By Ian Thompson

Plus putting custom metrics in action - observing my NAS with OpenTelemetry and the Splunk O11y platform

In my role here at Splunk, I get the opportunity to speak to a number of our customers about the challenges they are facing in building and running these complex platforms that power today’s environments. And what’s more, is that the challenges are more or less the same. In a nutshell, it is difficult to understand when something has gone wrong and to be able to quantify that impact so that the issue(s) - there is likely to be more than one - can be prioritised, quickly troubleshooted and assigned to the right team to be fixed. Having the right visibility and info to hand, resolving alert storms, false positives and being able to get the right information to the right team quickly has always been a challenge in the monitoring space and particularly more so today with the digital platforms we are building that are constantly changing, evolving and making use of innovative tech. This is why observability and the approach that it encompasses are so important in providing visibility into these platforms.
The other key challenge is the explosion in telemetry data from all three categories of observability; metrics, traces and logs. The volume of data that is and can be collected from these platforms is huge and this can present a number of issues - from collecting this data to processing and analysing it to ensure maximum value from an observability perspective.

Focusing on metrics, a common theme throughout these discussions is the ability to not only ingest the common and typical metrics that inform about the platform - usually the back-end infra and app - but also the capability of ingesting custom metrics so that a more detailed and complete picture of how the platform is working, performing and behaving is built. This complete picture allows SREs, DevOps engineers, developers etc. to quickly understand if something is going wrong and why. I frequently hear that if we had this metric or those metrics then the teams would be better prepared to understand if an issue was occurring and affecting their platforms and thus be able to respond faster to the issue.

Observability and Custom Metrics

Observability has its founding from engineering and as a concept, it is actually over forty-plus years old. It is all about observing the elements of a system that will quickly inform you about its state and thus whether there is a problem that needs fixing. This is why it is different from traditional monitoring and why it is important for modern digital platforms - observability is not a new name or branding for monitoring, rather it is a new and different approach to managing these complex platforms of today and being able to select key metrics and to create custom ones are necessary to manage and run these digital platforms.

Metrics, Metrics, Custom Metrics!

The type of metrics needed range hugely and depend on multiple factors including what the platform does, what needs to be reported on, what areas are lacking visibility, the tech stack, hybrid platforms etc. They may also be the key identifiers of past issues and need to be observed so that they do not happen again or provide visibility into other parts of the platform and tech stacks so that no component is left out of the overall picture. Collecting custom metrics means you can do really cool stuff too - check out this blog from my colleagues who leveraged OpenTelemetry - OTel for short - to collect data from a charity motorcycle ride across the UK. More on how OTel fits into collecting metrics later on.

Tools Sprawl

Another issue affecting teams today is the multiple tooling issue - or tools sprawl. And this is also not a new problem as it has been around the monitoring space for years. However, today’s large, complex platforms combined with the significant growth of innovative tech, have made this problem much worse. Given the use of more technologies, for example, has increased the number of tools - an additional tool can easily be used to observe that new technology or perhaps it comes with it.

The challenge with multiple tooling though is that it typically delivers visibility in isolated silos which are not correlated or contextualised and doesn’t provide that complete picture visibility of the platform. And more to the point, it still has gaps. It is also contributing to the DIY approaches to monitoring as developers can easily spend their development time building out monitoring tooling to specifically provide visibility into areas that they feel need it. Whilst this might provide immediate visibility into a key area that they need, it still adds to the problem of uncorrelated data, multiple tooling and visibility gaps that prevent a complete picture from being built. Furthermore, this DIY tooling can become restricted just to the developers with no access for the SREs, DevOps engineers and others who are involved in managing or running the platform.

Observability Control

The other key area that is gaining significant momentum in the observability of these platforms is the requirement to have complete control over the metrics that are indeed collected. The tech explosion that powers these platforms today creates huge volumes of metrics - even the typical infra ones have significantly increased due to the use of containers and microservices and the number of them used in platforms today. The ability to control not only what is collected but also how they are collected has now become a key focus in evaluating suitable O11y solutions.

The organisational make up of companies today has also dramatically changed. Like the platform itself, which has been broken down into smaller components with the use of containers and microservices, the engineering team itself has also followed, with many smaller teams - commonly known as tribes and scrums - being responsible for smaller components. This in turn has fuelled the drive to have control of their own observability and monitoring data and covers both what is to be observed and monitored as well as how that observability telemetry data is collected. The traditional ‘one size fits all’ monitoring approach or the old way of a central team choosing the monitoring tooling and ‘supplying’ it to these teams no longer works. What is needed today is the ability to have complete control over what is collected, standardise a data-collecting approach that fits the goals of the organisation and the teams within it and be able to change this quickly if and when needed.

Traditional Monitoring Approaches

These are typically routed in the principle of the vendor deciding the key metrics they think are needed to monitor a platform, wrapping that logic into a proprietary and heavy-weight monitoring agent that is deployed into the platform and using various approaches to poll and sample the monitoring data collected. The rationale behind these strategies is that it is scalable and provides ‘enough’ data to identify an issue when it happens. However, whilst this might have worked in the age of monolith, they do not work in today’s large, complex, microservices and multi-technology environments.

These approaches lead to visibility gaps resulting in it being difficult to identify a problem, quantify the impact of that issue and prioritise accordingly, difficulty in troubleshooting as the sampling approach means data is missed and a lack of control in terms of what and how the data is collected. Organisations are then trying to standardise the vendor’s approach to collecting data which will make it difficult to move to a different vendor in the future and will need other tooling to supplement the monitoring if that particular vendor’s agent doesn’t support the technology you are using.

Why OpenTelemetry?

Here at Splunk, we use OpenTelemetry - OTel - to collect the telemetry data to power the observability of these platforms. This industry standard and open source approach allows the separation of data collection from the vendor that then processes the data, and allows for complete control over both what you collect from a platform as well as how that data is collected. OTel allows your teams to build their own standard around telemetry data collection and what should be the ‘standard’ metrics etc. that are specific to your platform. What OTel provides is the capability for organisations to have one common language for building metrics. Getting, generating and sending custom metrics is easy with the use of OTel in the Splunk platform. Check out this blog on why OTel is needed in observability.

Custom Metrics in Action

Now, let’s put this theory into action! Like most home networks today, I have a NAS attached to it for storing a wide variety of media including photos, videos and music. And like most NAS devices, it has a number of background processes that go on, including the indexing services for the media files. I have noticed that on occasion these processes can slow down or get stuck, with the net result being that the NAS’s memory gets exhausted and/or does not have enough memory for other apps to start or be used. These are a key set of metrics to observe to ensure that the NAS drive is performing and to provide me with early warning of an issue, particularly with an indexing issue. So, I used the approach below to extract this info from my NAS and send it into the Splunk O11y platform:

Step 1 - Get Custom Metrics from the NAS and Create JSON File

We need to collect the metrics from the NAS device and then construct a JSON file which we can send into the Splunk O11y platform by the OTel collector. This is made even easier as JSON itself leverages the OpenTelemetry protocol and specifications - please click here for more details about the protocol.

The OTel collector was downloaded and deployed into a small VM and uses an authenticated token from my Splunk O11y platform tenant to communicate with it. Using the default settings of the OTel collector, it automatically starts to observe and monitor the VM itself, which can easily be viewed in the Splunk O11y UI as shown below:

We can see straightaway an early warning of an issue - the VM has issues with disk space, where the disk utilisation is expected to shortly reach the threshold limit. A quick change on the VM config will resolve this issue. From a NAS memory status perspective, I created a short script that connects to the NAS drive, runs a typical Linux memory command that pulls back the memory metrics and then selects the key ones to report on. A JSON file is then created for each metric and is then sent into the Splunk O11y platform via the OTel collector. An example of the script-created JSON file is shown below:

Step 2 - Send via OpenTelemetry

Once the JSON file has been created, it just needs to be sent to the collector which will then send it to my Splunk tenant. In my example, I have sent this JSON via a curl command at the command line, passing it the JSON file as shown below. I have complete control as to the frequency that these metrics are collected and sent into the platform.

Step 3 - View in the Splunk O11y Platform

In step one with the creation of the JSON metric data file, I defined a service name called ‘nas-storage’. All the metrics that I have sent to the tenant are all part of this service, so you can quickly search for them in the UI as shown below:

A quick visual is built to observe this part of the NAS:

I now have custom visibility into the NAS and can keep an eye on its performance. As performance is being tracked, any blips, outliers or outages are easily visible. I can also create detectors and alerts to automatically tell me when there is an issue as well as configuring remedial actions when a problem occurs.

In Summary

Leveraging custom metrics and having complete control over how all metrics are collected and sent into your O11y platform, is key to managing the complex modern platforms of today (and those of tomorrow!).

Try Splunk O11y for yourself by signing up for a free trial. Check out the links below for some great further reading:

My thanks to John Murdoch, our local OTel SME, for his valuable input into this blog.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram