Observability has been tied up with causality from its origins in the mathematical realm of control theory in the early 1960s. A system (of any kind, hardware or software, natural or engineered) was deemed to be ‘observable’ if it generated self-descriptive data from which it was possible to infer how states of the system were causally related to one another. Even now, long after the term has lost the mooring of its original context and has come to be applied to a broad range of technologies and practises in the worlds of DevOps and IT operations, many professionals associate observability with the ability to understand the root causes of system behaviour, particularly when that behaviour is associated with outages and poor performance.
But what precisely does the term ‘causality’ denote? It is surprisingly hard to nail that denotation down despite the fact that the determination of root causes of performance issues is almost always cited as one of the top asks of any monitoring or observability platform. In fact, it is not just an issue for IT professionals. Mathematicians, statisticians, physicists, economists, medical researchers, and lawyers all have trouble articulating what has proven, through the ages, to be a very slippery concept. We all know that causality is not the same thing as correlation but what is the added spice that causality brings to the table over mere correlation?
We are not seeking absolute truth here. Some of the greatest works of Western philosophical literature (e.g. Kant's Critique of Pure Reason) have been devoted to understanding what is special about causality and whole religions (e.g. the various Buddhisms) have turned on the answer to what it means for one thing or event to cause another. Instead, let us ask the question another way. In what ways does pure correlation fall short of what we need when we are trying to manage the behaviour and performance of a complex IT system that is meant to supply continuous service to a user community?
Correlation is essentially about prediction. A data set is composed of elements, measurements, texts, images, etc. and each of these elements possesses a certain number of attributes which themselves can be instantiated in many different ways. The background of an image, for example, could take on one out of a selection of colours while the foreground of that image could be a toy, an animal, or a human being. As one goes through the data set, one notices that most of the time the background is blue, and the foreground exhibits a toy - not always, but most of the time. Now, someone hands him a new image. The background is blue, but the foreground is covered. He is asked to predict what is under the cover and based on his past experience, he announces that there will likely be a toy in the foreground. The cover is removed and, lo and behold, his prediction is confirmed. For all of the simplicity of the example, this is how a correlation and predictions based on correlation work. Data is surveyed. A function associating the appearance of a value of one attribute with the appearance of a value of another attribute is constructed and then the function is used to make predictions.
In the realm of AIOps, predictive analytics are often discussed as if it were the holy grail of the industry. Many discussions are had as to whether or not some vendor or another is able to provide ‘true’ predictive analytics and debates about the meaning of ‘prediction’ are staples of IT operations management-themed conferences. These discussions are, however, often poorly framed. The truth of the matter is that most commercial technologies in this space have poor success rates, predicting events that never occur or failing to predict events that do, in fact, occur. Rough estimates that I obtained during my time as an analyst and then as an AIOps vendor CTO showed that of all the alerts generated by a typical AIOps platform only 25% were accurate. False positives (predictions of events that never materialised) constituted nearly 75% of the total while many events occurred in the relevant time period which the platform’s algorithms missed entirely. This abject failure, however, is not a consequence of the ‘paradigm'. Statistical correlation is how prediction is done. It is, instead, the result of consistently poor execution across the industry. In other words, our technologies already have ‘true’ predictive capability. They just do not work very well.
At this juncture, it should be noted that there is a large school of thought centred in the business strategy theorist community (as opposed to the IT practitioner or computer science academic communities) that interprets all of commercial AI at least as being about prediction. Titles like ‘The Prediction Machine’ by Agarwal are typical of this school. Interestingly, this kind of thinking matches up neatly with an increasingly popular school of thought in the world of cognitive neuroscience which makes the claim that all cognitive processes, even perception, are actually predictions followed by adjustments made to minimise the error of future predictions. What both schools are saying, in their own distinctive ways, is that no matter how complex the algorithm or how exotic the data structure, in the end, AI is mostly about predictions based upon established correlations.
I actually think that however insightful, both schools of thought are wrong. There are very important reasoning processes that cannot be reduced to prediction and error correction and further prediction. In fact, as will be discussed below, causal analysis is one such process. But what these remarks should indicate is that prediction is not the problem. Prediction comes, in a sense, for free from the most basic form of machine learning from data: correlation.
It might be replied, however, that, even if there are other types of reasoning processes, prediction (assuming that good correlations are made from good data) is precisely what is required for AIOps and Observability. Let us see why that is definitely not the case.
Let us say that three events occur in succession, A, B, and C. Furthermore, every time an event like C has been observed, it has been preceded by A and B in just that order. So can we assume that B causes C? Of course not! It could very well be the case that A causes B and C but B does not cause C. B is highly, indeed perfectly, correlated with C but it does not cause C.
Let’s unpack what this means. Assume that C is a kind of event that we do not want to have occurring, an actual outage, let’s say. Now, correlation allows us to use the occurrence of B as a predictor of C. We are notified that B has occurred and we can prepare the world to deal with the consequences of C occurring. That is a good thing without a doubt. But what if we want to stop C from occurring? If we were to proceed to take action to put some kind of ring fence around B so that its impact would not be felt in other parts of the system being monitored, that would not prevent C from occurring. On the other hand, if we acted on A in some way, we might stand a chance of preventing C. Moreover, looking towards the future, by blocking A from occurring in the first place, we can ensure that C does not occur. (We will ensure the non-occurrence of B but that is of no interest to us in this context.) In other words, for our purposes, the difference between a causal relationship between two events - A and C, in this case - and a correlational relationship between two events - B and C, in this case - is that the causal relationship shows how we can intervene in a system to change an outcome whereas the correlational relationship simply allows to predict an outcome. We can summarise this in a slogan that can also serve as a guide to evaluating software that purports to provide the user with ‘root cause analysis’ - there is no causation without the possibility of intervention.
There are further subtleties that a full account of causation needs to take into account. For example, A may cause C but may also block an event occurring that would cause C if A were prevented (since the blocker would no longer be active.) Another complication is the possibility that a causal relationship is probabilistic. It makes perfect sense to say that A causes C, say, 75% of the time, and once one opens up that avenue of consideration, questions of how to combine the probabilities of various causes to get the probability of a resulting event need to be addressed. The critical element, however, that runs through any account of causality - no matter how complex it gets - is the recognition that causality is a modal or adverbial concept - it tells us in what way a connection is operative whereas correlation is an extensional concept - it only tells us that a connection is operative. Put another way, to demonstrate causality one must be able to apply some kind of what-if or counterfactual operation to the network of correlations one has at hand.
Stay tuned for part 2 where we will discuss implementation issues and make some predictions about the market.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.