Observability for Sustainability

By William Cappelli

IT, Energy Consumption, and Carbon Emissions

For the past 20 years, the various stakeholder communities that together constitute the IT industry have attempted to address sustainability. The original efforts grew out of the realisation that even as far back as 2005, the hardware and software that underlay the digital world were responsible for approximately 5% of overall energy consumption and that both the percentage and absolute amounts of energy required were growing in the double digits. Whether one was concerned with costs or carbon, economic and political decision-makers believed this situation was something that needed to be addressed.

General technology trends were not helping. It was not just that governments, businesses and users were becoming more digital, architectures, programming languages, and software development practices encourage ever more profligate levels of energy consumption. Decades of Moore’s Law convinced hardware manufacturers that concern about energy consumption costs was a fool’s game as the arc of history bent towards ever lower levels being required to perform a given operation. The emergence of the Web and then the Cloud led simultaneously to a massive increase in energy consumption at the edge of the global IT infrastructure and the concentration of compute power into vast server farms scattered across the surface of the planet. Furthermore, general acceptance of the modular architectural principles meant that these service farms bought the ability to scale on demand at the price of excess energy consumption stemming from a high level of redundancy. Finally, while the first generation of higher-level programming languages was designed to match the operational possibilities of Von Neumann architectures precisely in order to minimise energy consumption and hence costs, the growing pressure from business to accelerate the software development cycle encouraged language designs that made it easy to write algorithms, and let energy consumption considerations be damned. The net result of all of that is now, in 2023, the global IT infrastructure accounts for approximately 30% of global energy consumption and the growth has not tailed off one bit. (In fact, to make matters worse, it seems that Moore’s Law ceased to be operative about 10 years ago.)

Of course, given the current state of technology, energy consumption - problematic in itself from a cost perspective - correlates closely with carbon emissions. Being the metric by which governments, investors, and activists have chosen to measure the likelihood and intensity of a climate apocalypse - the need to measure, report on, and analyse carbon emissions has become an overriding imperative and given that energy consumption is probably the best proxy that a business has with which to determine the amount of carbon its IT habits are injecting into the atmosphere, demand is beginning to emerge for technologies and practises that systematically observe, analyse, and allow for management of IT-related energy consumption.

Does Observability Have a Role to Play?

Standards-body-sanctioned measures associated with data centre energy consumption have been available for more than a decade, but in a world of modular, dynamic, cloud-based, and ephemeral architectures, the saliency of those measures is highly questionable. Instead, what businesses and governments need is a set of measures or indicators based on how the utilisation of an entire digital service contributes to the accumulation of carbon in the atmosphere. Interestingly enough, the technologies and practices which have come onto the market to observe, analyse, and support the remediation of digital services from an operational performance and availability perspective can with minor modifications be used to monitor energy consumption and carbon emission attributable to the holistic behaviour of a digital service.

Let’s do a quick review of the basic observability technology design plan:

First, there is the facility for ingesting telemetry in the form of time-stamped metrics, transaction traces, code profiles, and logs.
Second, there is an automated real time analytical facility for filtering out noise, correlating the data sets generated by the cleaned up telemetry streams, and, finally, determining root cause or source of anomalies that become apparent through the correlations.
Finally, there is a facility for persistently storing the now analysed data for future reference and batch analyses.

So how do we tweak this design plan to serve the goal of sustainability?

Computational Complexity as the Key to Measuring the Carbon Footprint

A digital service is essentially an ability to execute algorithms in response to events originating outside the service and the execution of an algorithm is nothing more than a sequence of state changes, each of which consumes a certain amount of space (actual physical space) and takes a certain amount of time. The analysis of algorithms with regard to how much space and time is required for computation is a core computer science topic which, although many mysteries remain (the famous P=NP problem, for example), has been well worked over by both the academic and industrial research communities.

Only rarely, however, has the fact that energy consumption correlates almost exactly with combined consumption of space and time and, if one could track, those two trajectories of consumption in real time, one would have a pretty good sense of the amount of energy being consumed (and carbon being emitted) as it is happening. Furthermore, by tracking these space and time units in the context of the algorithm obviates the need to closely examine what is going on at the various architectural layers that support its execution.

Two Tweaks Away

Almost every piece of data captured and analysed by an observability platform described as I have just done contributes to a detailed picture of how an algorithm executing within the context of a digital service consumes space, time, and hence energy. At the core of this measurement process is the ingestion of transaction traces. A transaction trace is, after all, nothing more than the path of an algorithm as its execution moves from component to component in a system. Furthermore, if code profiling is deployed, the execution steps within a component can likewise be captured. The trace itself tells us how much time is involved while logs and metrics can fill out the spatial consumption picture. Noise reduction, correlation, and causal analysis are as relevant to energy consumption analysis as they are to the analysis of performance and availability. In fact, there are, in the end, only two major tweaks required to convert an observability technology into a sustainability technology. First, accumulating total space and time consumption must be reported on continuously or at predefined intervals alongside whatever other data is displayed; and second, the concept of an alert must be extended to the transgression of energy consumption thresholds. Otherwise, it is less a question of new technology as it is a new way of thinking about technology already deployed.

Why Splunk?

Splunk, a leader in Gartner’s recent APM and Observability MQ, has built an observability platform that is effectively infinitely scalable and designed from the ground up to handle all of the data streams and analytics central to sustainability. Furthermore, the technology’s flexibility and ease of use means that practitioners with a background in energy and carbon emission management should be able to master the relevant elements of Splunk functionality quickly. As sustainability becomes increasingly central to the evaluation of business performance, Splunk provides tools that assist organisations in their efforts to monitor their environmental sustainability.

William Cappelli

Willam Cappelli is an Observability Strategist at Splunk. Widely recognised as a key interpreter and shaper of software market and technology trends, William wrote the first papers defining AIOps, APM, and DEM while serving as a VP of Research at Gartner and then, as a CTO at Moogsoft developed the broadly accepted five dimensional model for AIOps. In his spare time, he studies and translates Dharma language texts.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.