“You’re living in the past, man! You’re hung up on some clown from the 60s, man!”
- Eric The Clown
Application Performance Monitoring (APM) as a discipline and as a collection of supporting technologies has evolved rapidly since a distinct recognisable market for APM products first emerged in the 2007 - 2008 time frame. While there are many who would argue that APM has mutated into or been replaced by Observability, it makes more sense to see APM as one of many possible use cases now able to exploit the functionalities that Observability brings to the table - particularly when combined with AI. In fact, the failure to distinguish APM from Observability has, unfortunately, allowed vendors of more archaic forms of APM products and services to ‘Observability-wash’ their offerings. After all, if APM has evolved into Observability then, by a marketing sleight of hand, everything that was APM last year is now Observability, almost by definition!
Of course, the main problem here is not the fact that vendors are making dubious claims, it is, instead, that there is genuine confusion in the user community which leads to the inappropriate deployment of technology and a consequent undermining of an enterprise’s ability to deliver business value to its customer base. To dispel some of this confusion, let us try to get clear on the actual relationship between APM and Observability and articulate a set of questions one can ask in order to determine whether or not a prospective solution is providing APM genuinely enhanced by Observability or an older form of APM dressed up in the language of Observability.
It is possible to analyse the history of APM into four distinct stages. You might say that APM has endured four paradigm shifts if you like that terminology. The first stage or paradigm could be called ‘proto APM.’ As digital business and customer-facing applications began to assume importance in the noughties (and became a source of anxiety in the wake of the 2007 contraction), businesses demanded that IT focus on application as well as infrastructure performance and availability. At the time, specialised APM tooling was not available so the vendors of the day, in trying to meet customer demand, repurposed existing event management platforms, dating back to the 1960s in terms of basic design and capabilities. Although the object of concern was now the application portfolio rather than the server or network infrastructure, the Big Four (IBM, HP, CA, and BMC as they were collectively called at the time) simply extended their infrastructure-oriented ‘Frameworks’. Now, it is important to remember that the infrastructures of the day were topologically static, architecturally monolithic, and behaviourally predictable. Unanticipated events were rare and invariably an indication that something had gone terribly wrong solely in virtue of the fact that they were unanticipated. Hence, the availability and performance systems built to deal with such infrastructures relied heavily on relatively simple predefined (and hence unchanging models), sampling, and spotting the rare exception. Applications were, it was believed, only mildly more volatile and modular than the infrastructures they ran on so targeting them with the Framework approach seemed like a plausible way forward. It should also be noted, however, that despite similarities in structure and behaviour, there were sharp boundaries between the application realm and the infrastructure realm and this will become an important driver of our history going forward.
Users and vendors shifted to the second stage or paradigm, during the 2012/2013 time frame during which the first technologies designed from the bottom up to specifically monitor and manage end-to-end application behaviour came to market. Hence, we can justly call this stage ‘APM 1.0’. The great migration to the cloud had begun across industries in North America and at a much slower although equally sure pace in EMEA and the Asia Pacific region and, while infrastructure management was initially seen as one of the main tasks of would-be cloud service providers, the management of cloud-based applications was widely considered to be the proper preserve of the businesses. This emerging division of labour was, of course, made possible by the sharp lines separating infrastructure and application mentioned above. In fact, it was that division of labour that made the cloud move palatable to the many enterprises that had concerns about loss of control over the increasingly central digital channel through which they interacted with their customers.
Once in the cloud, however, (or in anticipation of moving to the cloud), application architectures started to change shape. At first, they more or less retained the monolithic, topologically rigid features of on-premise applications, but gradually they became more modular, distributed, and topologically dynamic. As a result, not only did application behaviour become less predictable but the ability to infer end user or customer experience on the basis of knowledge of state changes within the application degraded considerably. As a result, the returns from proto-APM technologies diminished rapidly, at first regarding cloud resident applications but then, as cloud-inspired architectures came to dominate the developer mindset, across the corporate application portfolio.
In response to the changing requirements, a spontaneous collaboration among users, industry analysts, and entrepreneurs resulted in the definition/proclamation of a ‘five-dimensional model’ for APM. In order to deal with the realities of cloud-driven application architectures, APM products needed to support and loosely coordinate five distinct types of functionality:
(It is worth mentioning that, despite the novelty of the five functional types the underlying technology closely resembled that of the Big Four’s repurposed event management systems - data was sampled and packaged into short event records that were compared against a predefined model and alerts were generated following a mismatch between model and event.)
Following the dictates of the five-dimensional model, a new crop of vendors (most notably, AppDynamics, New Relic, Dynatrace, and eventually, DataDog) brought APM 1.0 products to market and, in very short order, displaced the Big Four in enterprises across the global economy.
APM 1.0 proved to be extraordinarily successful, so that, by 2015, approximately a quarter of all enterprise applications in North America, cloud-resident or not, and global spending on APM cracked the $10 billion mark. But applications did not stop changing shape and the very success of the five-dimensional model turned it into a straight-jacket for many users, particularly those charged with application development. Their frustrations with the products provided by the market and their creativity in seeking end run around APM 1.0 shortcomings led to the emergence of a new paradigm or stage: APM 2.0.
First, let us take a look at what was happening with regard to application evolution and then it will be easy to see why DevOps practitioners, first, but then also many IT Operations practitioners staged a revolt against the five-dimensional model. Well, the rate of application modularisation increased as object orientation gave way to a more eclectic approach that, while not abandoning objects, introduced more and more functional constructions, wreaking havoc on the typing conventions of languages like Java. The focus shifted to the concept of micro-services, relatively small packages of function and method, that interacted with one another in a loosely coupled manner against an ever-changing topological backdrop. The components themselves had ever shorter lifetimes and, finally, applications were expected to be continually modified by development teams with the number of changes increasing by an order of magnitude in many large enterprises. Finally, once a microservice frame of mind had been adopted, the rigid borders between application and infrastructure began to break down. Yes, there were micro-services closer to the ‘surface’ where the application interacted with users or other applications which would unquestionably be recognised as ‘application-like’ components but those services called upon other services below the ‘surface’ and those services in turn called yet more deeply placed services. At what point did these components become infrastructure components? And remember they are all changing all the time and flitting in and out of existence with life spans, in many cases, measured in microseconds.
Returning to the five-dimensional model, it is now easy to see why its attraction faded rapidly. Its call for logical topologies delivered little value when topologies would shift structure in seconds. Deep dives based on byte code instrumentation provided a very limited perspective on application behaviour. They might provide insight on what was happening within a micro-service (if that service was ‘big enough’ to support the invasive instrumentation required) but when most of the application ‘action’ was taking place in the spaces between micro-services where messages were being passed, they were more likely to generate oversights rather than insights when it came to understanding end to end application behaviour. Transaction tracing functionality (dimension 4 of APM 1.0) might have come to the rescue here but popular implementations of this functionality involved the injection of rather large tokens into the code base, piling an invasive procedure on top of the invasiveness already required for deep-dive application server monitoring so that, by 2015, this type of functionality was rarely deployed in practice. Finally, the analytics that came packaged with most APM suites were woefully inadequate to the vast explosion in the size and complexity of the self-descriptive data sets generated by applications and harvested by the APM tooling.
Indeed, for both DevOps and IT Ops practitioners it became more and more apparent that APM should be treated as a big data problem and, as a consequence, these communities began to turn to technologies that gave them direct access to the underlying telemetry, without layers of intervening functionality intended to provide contexts for that telemetry. If a developer could easily capture and display all of the metrics or logs generated by an application, he could, perhaps with the support of effective visualisation, make sense of what was happening and spot troubling anomalies. Furthermore, since we were just talking about streams of telemetry, there were no concerns about topologies getting outdated or even about the application/infrastructure divide. Faced with an amorphous, ever-shifting population of interacting micro-services and functions, practitioners could just follow wherever the data led them.
And, so, without the sanction of analysts or vendors, by 2016, a new stage in the history of APM was reached, an APM 2.0 which was dominated by big data capture and analysis technologies, specialised according to different types of telemetry, most usually metrics or logs, and the growing perception on the part of users that vendors focused on metrics and visualisation (e.g., Grafana) or logs (e.g., Splunk) should play a central role in the management of applications. (It is important to stress that the rise of APM 2.0 was almost entirely a user-driven paradigm shift. Many of the providers of APM 2.0 technology only recognised the role that their products were playing in retrospect, if at all.)
If the APM 2.0 vendors were blissfully unaware of the revolution they were enabling, many of the APM 1.0 vendors were painfully aware of what was taking place on the ground. Fearing a reenactment of the almost overnight market shift that had allowed them to displace the Big Four began to patch telemetry ingestion and visualisation capabilities into their APM product portfolios. As a result, many of the APM 1.0 players morphed into what might be called an ‘APM 1.5’ status, offering their users what remained a fundamentally five-dimensional technology with some metrics and log management capability attached at the edges.
Terminology began to shift at this point and users, opinion makers, and some vendors revived an old Optimal Control Theory term - Observability - to describe what they were doing with their telemetry ingestion and visualisation software. And there was definitely justice in using a language that did not suggest that their activities were restricted to applications. As indicated above, the line between application and infrastructure has become blurred. Furthermore, cloud-based infrastructures themselves were becoming increasingly modular, dynamic, distributed and ephemeral while users increasingly took back control of infrastructure configuration and management. In other words, infrastructure also had come to require (and receive) a big data-style treatment.
The pandemic arrived and, while many aspects of the economy slowed down, the evolution of APM did not. DevOps and IT Ops practitioners were now wallowing in huge volumes of telemetry unfiltered by finicky APM (or infrastructure management) technologies and while this was definitely seen as an improvement, the truth of the matter is that the need to understand and work with datasets so large and fine-grained brought a whole new array of challenges. First of all, the data sets proved amenable to sampling due to both the uncoupled and rapidly changing nature of the underlying systems generating the data and the high dimensionality of the data which rendered many traditional statistical techniques ineffective. That meant that to work meaningfully with telemetry, one needed to work with ALL the telemetry available. Toil levels went through the roof on account of volume alone. Secondly, and perhaps more importantly, the volume and fine-grainedness of the data sets made it almost impossible for human beings to see the patterns governing the data and hence threatened to render the data sets useless for the more difficult problems pressing the practitioner.
The solution to the intelligibility and toil issue came in two steps:
In summary, then, when deciding on an APM technology adequate for modern environments, one should ask the following six questions:
An answer of yes will just about assure the practitioner that he is working with an APM 3.0 solution adequate to the current and (at least, near future) requirements. An answer of no to any of them suggests that he is working with an APM 2.0, or worse, an APM 1.5 solution, adequate to the legacy portion of an application portfolio but unlikely to meet the demands of a modern digital business
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.