Splunk’s recent update to its Machine Learning Toolkit (MLTK) is a good reason to spend a few paragraphs thinking through the links between Observability and machine learning. First, let us quickly review what Observability and machine learning are. Observability is the name given to a set of technologies intended to gather granular telemetry from digital environments and applications and, on the basis of that telemetry, alert, predict, and diagnose the root causes of digital system performance problems. Machine learning, on the other hand, is the name given to a family of algorithms that, on their own or with input from external sources, discover patterns in large and evolving data sets. Just from these abstract descriptions alone, it is clear that machine learning and Observability are mutual value multipliers. Observability technology provides the data on which the machine learning algorithms thrive, while those algorithms can return the patterns or models which Observability practitioners need to drive the alerting, prediction, and root cause diagnosis that are the ultimate justifications for Observability in the first place.
At the moment, the market has been seized with what is pretty much a level of hysteria regarding the promise of AI, particularly in the form of Large Language Models (LLMs). While LLMs are certain to have an impact on Observability systems and other software associated with the management of digital environments and application portfolios, it is important to be clear about the boundaries and relationships between AI, machine learning, LLM algorithms, and other related technologies. AI is a general term for algorithms (and occasionally specialised hardware) the design of which takes inspiration from human cognitive processes or their supporting biology. This is not, of course, to say that these algorithms are meant to imitate human cognitive processes in any precise way. In fact, they are often deployed because they are intended as enhanced versions of what humans can do on their own. The critical thing, however, is that, in some way or another, they are meant to act in a way that resembles the way we perceive, think, and decide about actions.
AI algorithms then subdivide into two major types:
ML, then, is a subtype of the class of pattern discovery algorithms - a sub-type the members of which learn how to extract patterns only after many interactions with varying data sets. There are many distinct types of ML algorithms but, for a variety of reasons (not the least of which has been very effective academic marketing), commercial interest in ML has largely focused on neural networks, whether flat or multi-layered. Neural networks, inspired by a rough model of the evolving state of synaptic connections among neurons in the brain, work by adjusting the weights given to values passed from one ‘neuron’ to another until a desired classification outcome is achieved. While the actual ‘training’ of a neural network is almost impossible to map and interpret mathematically, the claim has been that the results are empirically impressive. In any case, most commercially successful machine learning implementations are based on NNs of one sort or another. Finally, LLMs are an NN subtype, in which the data sets basically consist of texts and the patterns of question and answer discovered within those texts are used to generate responses to further questions. So, in the rest of this note, we will be looking at the interaction between Observability and ML as a whole. Some remarks about LLMs will be made at the end but the particular case of LLM will be discussed in greater depth in a later note.
Keeping these definitions in mind, let’s now look at the relationship between ML and Observability in a bit more depth. A close look at how Observability systems give practitioners the means to understand what is going on in the digital environments and applications under their care reveals that these technologies function on two levels. First, they ingest data directly from the environments and applications themselves - usually in the form of metrics, traces, or logs. Traditional monitoring systems were content to work with very sparse samples and usually confined themselves to metrics. Observability systems, however, following upon the recognition that modern digital environments and applications were increasingly modular and loosely coupled, try to work with as much data as possible while expanding the range of data types as well. (Splunk and a handful of other vendors go a step further and provide technologies that ingest all of the data available.) The data ingested is extremely granular and signals digital states only indirectly in much the same way that symptoms of an illness are only indirect signals of what is going on in the body of the individual who is ill. Consequently, the first step that must be taken should allow the practitioner to move from the symptomatic signals provided by the granular data to events or state changes actually taking place.
While this move can, in some cases, be executed by means of relatively straightforward, repetitive processes (e.g., the assembly of a trace out of tags and span metrics), in other cases, complex pattern discovery will typically be involved (e.g., the discovery of anomalies in time series metrics). In this situation, ML can and increasingly will play a critical role in making sense of what the Observability system is ingesting. Indeed, even in the case of trace construction, once the scope of the trace is expanded beyond a self-contained component (e.g., Kubernetes), the accuracy of the result is likely to be vastly improved with a good dose of ML. In short, during this first step, ML supports the conversion of low-level granular data into proper event or state change signals.
But that, of course, is not the end of the story with Observability technologies. In part 2, we will explore what services ML performs once the practitioner knows what events or state changes are actually taking place in the environment of concern.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.