Observability and Machine Learning [Part 1]

By William Cappelli

Splunk’s recent update to its Machine Learning Toolkit (MLTK) is a good reason to spend a few paragraphs thinking through the links between Observability and machine learning. First, let us quickly review what Observability and machine learning are. Observability is the name given to a set of technologies intended to gather granular telemetry from digital environments and applications and, on the basis of that telemetry, alert, predict, and diagnose the root causes of digital system performance problems. Machine learning, on the other hand, is the name given to a family of algorithms that, on their own or with input from external sources, discover patterns in large and evolving data sets. Just from these abstract descriptions alone, it is clear that machine learning and Observability are mutual value multipliers. Observability technology provides the data on which the machine learning algorithms thrive, while those algorithms can return the patterns or models which Observability practitioners need to drive the alerting, prediction, and root cause diagnosis that are the ultimate justifications for Observability in the first place.

AI, Machine Learning, and Large Language Models

At the moment, the market has been seized with what is pretty much a level of hysteria regarding the promise of AI, particularly in the form of Large Language Models (LLMs). While LLMs are certain to have an impact on Observability systems and other software associated with the management of digital environments and application portfolios, it is important to be clear about the boundaries and relationships between AI, machine learning, LLM algorithms, and other related technologies. AI is a general term for algorithms (and occasionally specialised hardware) the design of which takes inspiration from human cognitive processes or their supporting biology. This is not, of course, to say that these algorithms are meant to imitate human cognitive processes in any precise way. In fact, they are often deployed because they are intended as enhanced versions of what humans can do on their own. The critical thing, however, is that, in some way or another, they are meant to act in a way that resembles the way we perceive, think, and decide about actions.

AI algorithms then subdivide into two major types:

inference-based - where the algorithms basically mimic the logical passage from one proposition to another in an argument;
pattern discovery - where the algorithms are presented with data and then seek to find a pattern that describes or is able to generate the data they have been presented with.

ML, then, is a subtype of the class of pattern discovery algorithms - a sub-type the members of which learn how to extract patterns only after many interactions with varying data sets. There are many distinct types of ML algorithms but, for a variety of reasons (not the least of which has been very effective academic marketing), commercial interest in ML has largely focused on neural networks, whether flat or multi-layered. Neural networks, inspired by a rough model of the evolving state of synaptic connections among neurons in the brain, work by adjusting the weights given to values passed from one ‘neuron’ to another until a desired classification outcome is achieved. While the actual ‘training’ of a neural network is almost impossible to map and interpret mathematically, the claim has been that the results are empirically impressive. In any case, most commercially successful machine learning implementations are based on NNs of one sort or another. Finally, LLMs are an NN subtype, in which the data sets basically consist of texts and the patterns of question and answer discovered within those texts are used to generate responses to further questions. So, in the rest of this note, we will be looking at the interaction between Observability and ML as a whole. Some remarks about LLMs will be made at the end but the particular case of LLM will be discussed in greater depth in a later note.

From Symptoms to Events

Keeping these definitions in mind, let’s now look at the relationship between ML and Observability in a bit more depth. A close look at how Observability systems give practitioners the means to understand what is going on in the digital environments and applications under their care reveals that these technologies function on two levels. First, they ingest data directly from the environments and applications themselves - usually in the form of metrics, traces, or logs. Traditional monitoring systems were content to work with very sparse samples and usually confined themselves to metrics. Observability systems, however, following upon the recognition that modern digital environments and applications were increasingly modular and loosely coupled, try to work with as much data as possible while expanding the range of data types as well. (Splunk and a handful of other vendors go a step further and provide technologies that ingest all of the data available.) The data ingested is extremely granular and signals digital states only indirectly in much the same way that symptoms of an illness are only indirect signals of what is going on in the body of the individual who is ill. Consequently, the first step that must be taken should allow the practitioner to move from the symptomatic signals provided by the granular data to events or state changes actually taking place.

While this move can, in some cases, be executed by means of relatively straightforward, repetitive processes (e.g., the assembly of a trace out of tags and span metrics), in other cases, complex pattern discovery will typically be involved (e.g., the discovery of anomalies in time series metrics). In this situation, ML can and increasingly will play a critical role in making sense of what the Observability system is ingesting. Indeed, even in the case of trace construction, once the scope of the trace is expanded beyond a self-contained component (e.g., Kubernetes), the accuracy of the result is likely to be vastly improved with a good dose of ML. In short, during this first step, ML supports the conversion of low-level granular data into proper event or state change signals.

But that, of course, is not the end of the story with Observability technologies. In part 2, we will explore what services ML performs once the practitioner knows what events or state changes are actually taking place in the environment of concern.

Splunk Developer Summer 2021 Update

It’s getting hot here in California as Summer rolls on, and we have some hot updates for you across the Splunk platform, Python SDK, Splunk Cloud vetting, and more!

Observability 6 Min Read

How To Visualize Business Service Performance with Splunk ITSI

Splunker Sid Govindu's guide is your starting point in leveraging Splunk ITSI for effective service decomposition and performance visualization.

Observability 4 Min Read

Announcing the General Availability of Splunk Mobile RUM for Native Mobile Apps

Splunk RUM now extends monitoring of customer experience beyond web browsers, and into native mobile applications, helping mobile app developers and SREs improve performance and isolate customer facing issues on iOS and Android.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram