The Digital Operational Resilience Act (DORA) is a EU regulation that entered into force on 16 January 2023 and will apply as of 17 January 2025.
DORA brings harmonisation of the rules relating to operational resilience for the financial sector applying to 20 different types of financial entities and ICT third-party service providers and while press discussion (and even the press releases from the EU itself) has emphasised the security dimension of operational resilience, a close reading of the texts associated with the regulation demonstrates an equal if not greater focus on the need for intelligent (i.e. AI enhanced) observability.
The term ‘Operational Resilience’ lies at the heart of DORA’s model of an optimal relationship between business and IT and the challenges of functioning effectively as a financial services concern in a rapidly digitalising world. These discussions, however, have often been inconclusive, or worse, misleading because they fail to acknowledge the extent of the transformation which society, business, and technology have undergone over the last five years, and particularly over the course of the Covid pandemic. The basic idea behind Operational Resilience is that private and public sector organisations engage in a periodically repeating sequence of operations but that, at times, disruptions occur either within the organisation itself or in the environment in which the organisations act. This disruption requires rapid and effective remedial action so that some kind of equilibrium can be restored and the normal operational sequence resumed. Put another way, a world without disruptions is treated as the norm and, since disruptions are rare, the structures, processes, and technologies that support and enable operations are optimally designed and deployed around the assumption that the disruptions, for all intents and purposes, do not happen. To borrow terminology from the realm of machine learning, for an enterprise or a government function to design and deploy for frequent disruption would result in overfitted structures, processes, and technologies that may prove optimal for one circumstance but will likely crumble in the face of the next, novel disruption.
But what if disruption is not the exception? What if, on the contrary, trends across social, economic, and technological dimensions have converged to land our organisations in a world where disruption is not only a regular occurrence but for all intents and purposes continuous? In that case, does the notion of resilience even mean anything at all? If the central point of any effort is to ensure that when disruptions occur, equilibrium is restored, what happens to this effort when there is nothing that one can reasonably describe as an equilibrium?
Perhaps, the answer is to abandon the term ‘Operational Resilience’ and its associated concepts altogether and focus on meeting each challenge as it comes and hoping for the best. We think not. Instead, we will argue that a modification rather than a rejection of these concepts is required and will christen the modified concepts with the term ‘Agile Resilience.’ It will be shown that there are indeed structures, processes, and technologies well tuned to environments that are in a state of continual flux and that they can be appropriately modified to guide enterprises and governments as they seek to maintain resilience in the modern digital world. It will be further explained that they are derived from the design principles underlying modern data intensive computing and communications systems which in turn, increasingly, draw inspiration from human cognitive processes and their neural underpinnings. So, in a way, our organisations will come to exhibit ‘Agile Resilience’ the more they come to resemble ourselves as observing, analysing, and acting individuals required to deal with the fluctuating challenges of everyday life.
In what follows, we will first make the case that the world has passed through a kind of complexity singularity with the result that disruption is not only a constant factor but that continual disruption of the surrounding environment is the only way in which a business or government can be successful. Our organisations then are not just victims of disruption. In so far as they are successful they promote disruption. Second, we will show that achieving resilience in a world of continuous disruption is primarily a matter of how an organisation modifies its ability to observe signals, analyse them, and respond to what has been observed. We will critique many of the observation, analysis, and response processes in place, making the case that, being designed for a world in which disruption was the exception, they simply do not work in a world where disruption is the norm. Third, and finally, we will propose a structure and a process, and suggest technologies that will enable genuine agile resilience. Some suggestions on how to get started on the road to agile resilience will also be made.
It is generally conceded that the world has become more complex over the past ten years but the concept of complexity is itself a fuzzy one. There are a number of mathematical notions that try to give the concept a more precise quantitative grounding. There is, for example, the idea of computational or time-space complexity which classifies a process as being complex if the rate at which it takes up time and/or space accelerates rapidly with the size of its input. There is also the idea of Kolmogorov or descriptive complexity which judges the complexity of a process or a structure on the basis of the size of its description. There, of course, many more such ideas but a thread which passes through all of them is the notion of predictability or, inversely, their inherent ability to surprise anyone who is observing them. In general, the more complex a system, a process, a structure is, the more difficult it is to infer the state of the whole system, process, or structure from the state of a small number of its parts. That difficulty can be measured by time-space complexity or by descriptive complexity but its most convenient measure is entropy, the inability of an observer to predict the information about the other parts of what is being observed from what is currently observed. So, if our discussion is on the mark, we can say that over the past ten years, the entropy of the world, its sheer unpredictability from one moment to the next, from one place to the next, has increased significantly.
The pandemic, of course, itself is an example of the workings of this entropy. Whatever its origins, there is no doubt that the world’s interconnectedness drove its spread and its impact. There is also no doubt that, outside of think tanks that are paid to speculate on future events but also, in the multiplicity of their speculations, undermine their predictive capacities, the idea that a pandemic would cause the world’s economy to be shut down overnight was not anticipated by most whose jobs it was to ensure resilience. Put another way, Covid is a prime example of what it means to live in a world that has attained current levels of complexity.
So what are these levels of complexity? There are three of them: 1) market complexity; 2) technological system complexity; and 3) data complexity. Let us look at each of them in turn.
Each level of complexity can be interpreted as a collection of causes that enhance the entropy of the overall environment faced by an organisation. With regard to the market, there are four such causes:
1) Almost all markets for goods and services have become increasingly specialised and fragmented. For multiple reasons, including what technology makes possible and the breakdown of traditional social structures, products and services are developed and delivered to ever smaller groupings of customers.
2) Markets have become increasingly dissociated from geography. While geography and local history retains some hold, particularly outside of Europe and North America, the appeal of a given product or service is increasingly a function of individual tastes and requirements that are independent of where they happen to be located and, at least until the pandemic, that location itself was becoming increasingly ephemeral.
3) Markets have become increasingly prone to sudden transformation by exogenous factors. This is a joint result of the ability of single events to rapidly spread their effects across the globe and the increasing power that governments and transnational entities exert over market forces. Thanks to both, drastic changes in market process and structure can occur almost overnight.
4) Markets are individually increasingly short-lived. This is arguably just another aspect of the first cause but, nonetheless, it deserves special notice as the prospect of a product capable of generating value for an enterprise for only short periods of time (say six months) promises to significantly reshape the way in which enterprises and governments invest capital and structure their delivery motions.
Information technology is indirectly responsible for many of the causes mentioned in the previous paragraph. However, it is important to note that, even as it has driven change in the world, even as it has increased market entropy and complexity, system architecture and attributes have themselves undergone a profound fourfold transformation that, in some respects, mirrors the changes in the market structure and process just described. In fact, one could argue that a feedback loop has been inadvertently put in place where the technology systems drive market transformation which, in turn, drives a similar technology system transformation and so on and so forth. With that noted, we can list the four causes of information technology system transformation:
1) Systems are becoming increasingly modular. Historically, IT systems or IT driven operational systems were composed of a small number of elements, each monolithic in structure, and usually highly differentiated from one another. The last ten years, however, have witnessed an extraordinary increase in the number of components out of which such systems are built. Furthermore, differentiation among the shape and design of the system components is likewise diminishing and while they are all coming to possess a similar range of behaviors, that range of behaviors is growing and it increasingly difficult to predict in advance which behavior out of that range any given component will choose to perform at any given time.
2) Systems are becoming increasingly distributed. Even before the cloud became almost omnipresent, the components out of which IT systems provided their services were distributed over larger and larger segments of the globe. With the cloud and GPS infrastructure, what the user or customer sees as a single integrated stack of functionality often in fact involves calls to elements not only situated in remote corners of the planet but circling the upper atmosphere as well.
3) The relationships among system components are changing at an ever accelerating rate. Historically, the basic paths over which data generated by individual system components were more or less rigid. In fact, it was usually possible to assume that an information technology system possessed an unchanging topology which served as a frame within which all computational and communication activities took place. Now, however, these paths of connection change so rapidly that the very concept of a physical and logical message communication topology has begun to lose its descriptive utility.
4) Finally, and in some ways this is the most important cause of increased complexity at the level of IT and Operational Technology systems, the components themselves have become increasingly ephemeral. Whereas a virtual server may have had lifetime measured in months or even weeks, modern systems are being built out of microservices and containers, a good many of which have lifetimes measured in microseconds. (And that is not the end of the story here, as the world transitions to serverless and function-based system architectures, one will need an atomic clock to measure the lifetime of at least some of the components out of which they are built.)
Managing resilience in the face of the mutually reinforcing behaviors of modern markets and modern information technology requires data about both. In fact, the increased entropy for which both are responsible makes data all that much important. Historically, when market dynamics were broadly predictable or, alternatively when IT systems generally functioned in behavioral cycles, data was certainly helpful but most of the heavy lifting was done by pre-defined models. These models may have been tested, from time to time, against sampled data but the information driving interventions overwhelmingly came from inferences based upon static model properties. The modularity, distributedness, dynamism, and ephemerality of both markets and information technologies, however, means that static models and sampled data are unlikely to serve as reliable guides to action. Therefore, the world has come to accept the need for vastly expanded data sets and has sought out new methodologies to replace those that rely on predefined models. This can be summarised as a shift from a model-driven to an evidence-driven way of coping with the world.
Data itself has evolved, however, and while this has been a necessary consequence of the kinds of objects and situations (e.g., the markets and IT and Operational technology systems) about which the data must convey information. Nonetheless, the new data properties required bring with them their own complexities and their own entropy which managers of resilience must additionally take into account:
Gartner has captured three of the most significant new properties which modern data sets possess in a series of reports written over the past ten years. 1) Data sets have grown in volume. In fact, research carried out by one of the co-authors of this paper when he was at Gartner showed that the number of self-descriptive data types generated by markets and IT systems have grown by an order of magnitude every five years for the last five years. 2) Data sets have grown in variety. The number of distinct types of data whose instances populate our sources of information have multiplied and will likely continue to multiply as digitalisation reaches further and further into the daily lives of peoples and businesses. 3) The velocity with which data sets change has also increased appreciably. This is partly due to the fact that the environment generating the data itself is changing ever more rapidly as we have discussed above but it is also partly due to an acceleration in the rate of the data changing processes as well. Thus far Gartner but there is a fourth factor as well, a fourth V to be added to the list - vectoriality or dimensionality. The number of dimensions or the number of attributes that need specifying in order to pick out one piece of data rather than other has likewise grown very rapidly so that now it is not unusual to be dealing with data sets wherein the the number of dimensions exceeds the number of data items in the sample. Of course, this makes reading and understanding the data more difficult simply because there is more work to do but it has another more profound impact. Most statistical techniques traditionally deployed presuppose that the dimensionality or vectoriality of the data is small relative to the sample size. If that no longer holds, then those techniques yield nothing but noise. Indeed, it is this last V that has driven the demand for machine learning - and more importantly - the mathematical methodologies that underlie these computer generated analyses.
Beyond the markets, the IT systems, and the data, there are the methodologies that have been brought to bear on restoring equilibrium in the face of disruption. These methodologies can be grouped under two headings and, as we shall see, the transformations discussed previously will force a change on each of them. The first set of methodologies involves the way in which an organisation captures data about its environment. As indicated above, these methodologies have tended to rely heavily on sampling and traditional statistical techniques to ‘fill in the gaps’ in the data which are the inevitable result of sampling. Unfortunately, both the high degree of entropy in the underlying markets and systems and the high vectoriality of data elements themselves means that sampling and traditional statistics simply don’t work. Hence, a new approach to resilience will require a new approach to data collection. The second set of methodologies involves the way in which an organisation adds context to and interprets the data it has collected. As discussed above, the main tools for interpretation are predefined models of an organisation’s environment against which the data collected is compared. If the data and the model match, then all is well but if there is a mismatch then further investigation and model revision is required. Unfortunately, given the speed of change both in the data sets and the underlying realities, most predefined models are out of date before they are compared with any data whatsoever. The result, then, is an almost continuous stream of mismatches which purport to indicate issues but, in fact, are just noise. Hence a new approach to resilience will require a new approach to model generation and interpretation.
So given the breakdown of traditional approaches to operational resilience in the face of the new realities just outlined, how should those charged with ensuring resilience respond?
The response will require, at its core, some kind of cognitive enhancement of the normal human observational capabilities. Data, as discussed above, comes in many forms but from a digital perspective it can be reduced to four fundamental types or telemetry streams. First, there is numerical data in the form of time-stamped metrics. This data typically provides basic information about the occurrence and timing of events taking place in a digital environment. Second, there is data that traces the flow of such events across a digital infrastructure, providing, in other words, locational or topological information about the events taking place. Third, there are structured event records, which provide structured information about and unique names for the events taking place. And finally, there are logs, which are essentially repositories of unstructured information of the digital events. Given the complexity and volumes involved, AI becomes a crucial element in making observations on the basis of metrics, traces, event records, and logs actionable.
Now, AI should not, here, be taken to mean some kind of attempt to replicate or automate human cognition. It should rather be seen as a class of algorithms that are perhaps inspired by actual human cognitive function but are intended ultimately to enhance and extend the ability of humans to observe, understand, and act upon continuously evolving environments. This class breaks down into seven sub-classes with each class serving as an algorithmic filter on a flow from signal to response.
So, in order to achieve the goal of operational resilience, in order to deliver what managers of financial services concerns a seven step approach is required that explicitly accords a central role to observability, automation and AI.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.