The Importance of Traces for Modern APM [Part 2]

By William Cappelli

In part 1, we looked at how the design plan of traditional monitoring technologies depended heavily on properties of the systems that were intended to monitor and then showed how those properties began to be undermined by an increase in complexity, an increase which can ultimately be captured by the concept of entropy. In this part, we will explore how increased entropy forces us to rethink what is required for monitoring.

As entropy increases, the amount of information borne by each distinct information item increases with the consequence that sampling becomes a liability when one is trying to understand what is actually going on within a system. Put simply, the more entropy a data set has, the more likely the information items passed over by a sampling mechanism will reveal important dimensions of system configuration and behaviour. Of course, one could possibly compensate by increasing the sampling mechanism’s degree of sophistication but this tactic has two flaws. First, the increased sophistication of the sampling mechanism will increase the cost and complexity of the monitoring technology undercutting if not completely eliminating the advantages attributable to sampling in the first place. Second, high entropy means that almost every information item contains at least some non-predictable revelation so even if sophistication reduces the amount of data discarded from, say, 90% to 50%, the discarded items may very well contain crucial information about issues in the digital environment. In summary, over the past 15 years, as the self-descriptive data sets generated by systems for the purposes of monitoring and observation have become more entropic, the efficacy of sampling has declined precipitously. In fact, in high entropy situations, sampling provides a completely misleading picture of system configuration and behaviour and is more than useless as a basis for analysis.

The Increasing Irrelevance of Anomaly Detection

Let us elaborate on another point made earlier which touches more directly on the kinds of analysis that are possible and necessary for modern systems. Low entropy, as indicated, is a function of both the constitution of the data sets and the configuration and behaviour of the systems themselves. When configuration is repetitive and behaviour is cyclical, the occurrence of an anomaly is highly significant. An environment which is not expected to surprise anyone raises very loud alarms when something surprising does occur. In such a setting, anomaly detection moves to the forefront as an effective method for analysing data and a launchpad for important predictions. Furthermore, the mere detection of anomalies is a relatively easy algorithm type to implement so, not only is it effective, but it is also relatively inexpensive and certainly does not require much in the way of distinctive IP.

In a highly modular, dynamic, distributed, and ephemeral system, however, the nature and significance of anomalies change. First, anomalies, themselves, are not unusual. Indeed, if the data set is highly entropic, then, in a sense, almost every data item signals an anomaly. Second, in this situation, it is clear that the occurrence of an anomaly, in and of itself, does serve as a signal or predictor of problems. Interestingly, these two issues track the two biggest complaints heard from DevOps and IT Ops practitioners regarding their monitoring tools: a) they seem to generate inordinately high volumes of alerts (most monitoring tools emit alerts when an anomaly is recognised so a high anomaly frequency is bound to generate high alert volumes) and b) an inordinate percentage of alerts turn out to be false positives (if a modern system almost all behaviours contain elements of novelty, which does not, in and of itself, signal that something problematic is now or will soon take place.)

Traces Move to the Center

So with anomalies removed from their historical pride of place as the key source of alerts and signals of existing or pending outages, what can replace them? Let us look at how processing actually occurs in a highly modular environment. While a certain amount of computation and data transformation takes place within the various components that together constitute that environment, most of the execution of business logic occurs through the passing of messages from one component to another. Indeed, one accurately visualises the execution as a branching flow of messages starting from one node and spreading out to other nodes which in turn send an array of messages to multiple nodes beyond them. It is also possible for messages to be sent to nodes already passed through. All in all, the kind of structure of nodes and paths where nodes can be visited multiple times during the course of the execution is called a directed graph and has well-understood and analysable ‘topological’ as well as metrical properties.

If an outage actually occurs, then the efficacy of such directed graphs is obvious. One finds the nodes where the outage has manifested itself (the manifestation usually taking the form of the node failing to emit any further messages or emitting them at a very slow rate) and then moves in reverse direction on the graph until one gets to the first node that could be the source of the problem. Beyond that, however, one can do many different kinds of predictive analysis on directed graphs. For example, if a survey of history shows that a particular node tends to be accepted and emit more messages than others, it is likely to become a weak point and a long-term source of significant outages. Steps can, then, be taken in advance to distribute the message traffic more evenly across multiple nodes. And this is only the beginning. The new mathematical discipline of computational topology (or topological data analysis) has already been shown to have many fruitful applications to observability as long as computations can be described in terms of directed graphs. In summary, the centrality of anomalies is being replaced by the centrality of analysing the metrical and topological properties of traces to generate directed graphs describing execution in a highly modular environment.

The first dimension - end user experience monitoring - retains its importance. Indeed, if anything it becomes even more important because the actual course of an end user’s experience is far more difficult to predict in a modular, dynamic, distributed, and ephemeral environment than it was for the more coarse grained environments of 10-15 years ago. The second dimension - however - application topology discovery - has a value approaching 0 since it is unclear that the idea of a static topology (as opposed to a dynamically generated topology that tracks execution traces) means anything at all. The third dimension - deep dive byte cold level tracking - retains some importance because critical events can still take place within the various components constituting the application but since so much logic is executed via message passing, the third dimension is much less significant than before. It is of course the fourth and fifth dimensions - transaction tracing and analytics - that emerge as the most central to APM at this point in the history of commercial IT and their importance is only likely to increase (and the relative importance of deep dive byte code based monitoring to diminish further) as we move from the hundreds of components that characterise a micro-service based application to the thousands of components likely to characterise most function based serverless application architectures.

^{Gartner, Magic Quadrant for Application Performance Monitoring and Observability, Gregg Siegfried, Mrudula Bangera, 5 July 2023}

^{GARTNER is a registered trademark and service mark of Gartner and Magic Quadrant is a registered trademark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and are used herein with permission. All rights reserved.}

^{Gartner does not endorse any vendor, product or service depicted in its research publications and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.}

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.