Fault incidents are inevitable. They occur in any large-scale enterprise IT environment, especially when:
In fact, research indicates, more than half (50%) the leaders in tech and business organizations consider the complexity of their data architecture a significant pain point.
From an end-user perspective, businesses must overcome complex architecture in order to ensure service delivery and continuity. While fault incidents may be unavoidable, a fault tolerant system goes a long way toward achieving this objective.
Let’s take a look at fault tolerance, including core capabilities of a fault tolerant system. Then, we’ll turn to a new topic: how AI can help ensure fault tolerance in your systems.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
Fault tolerance is the term for continuity of operations in the event of a fault, failure, error or disruption. Put simply, fault tolerance means that service failure is avoided in the presence of a fault incident.
To ensure continuous and dependable operations, IT systems and software are designed for fault tolerance. It’s a dependability that is introduced by capabilities that actively overcome disruptive and anomalous events, including:
Fault tolerant system design is tested and measured against dependability and reliability metrics (aka failure metrics) such as Mean Time to Failure (MTTF) and Mean Time to Repair (MTTR), among others.
So how do you develop a system that is more fault tolerant? A fault tolerant system uses the following key capabilities:
Architectural abstraction is defined as the generalization that obscures the complex inner workings of an IT system.
Subsystems operate in isolation, yet communicate with other system components via strong integration, interfacing and interoperability. The abstraction layer guides how the dependent subsystems behave in response to a fault. If one subsystem fails, the architecture can adequately interface with a redundant subsystem, realizing the missing functionality.
The layer of abstraction means that developers and system design architects can, the in event of a service outage or disruption, programmatically move workloads and guarantee fault tolerance.
Frameworks and design principles such as the Software Oriented Architecture (SOA) and microservices allow users to create fault tolerant systems for high service dependability.
Load balancing enables fault tolerance: load balancing automatically distributes network traffic over multiple servers, containers and cloud instances. Load balancing systems optimize resource utilization in response to changing and network traffic demands and usage spikes.
A service-oriented architecture and microservices-based environment can be designed to run workloads on different resources depending on:
The load balancer constantly monitors the health of its target resource entities and can be configured to route mission critical workloads to specific targets when the health of an IT system deteriorates below an acceptable threshold.
(Learn about load balancing for microservices.)
Modularization and Isolation allow users to contain fault impact and damages to the network performance. Here’s how…
When one subsystem fails, load balancing technologies and a software-oriented architecture will:
The isolation may span the control plane, data plane and management plane, depending on the nature of the fault. By default, such an isolation may not be embedded into your systems.
Cloud vendors may offer logical isolation and modularity for fault tolerance within their own systems. Multi-cloud systems, however, require additional tools and customizations that eliminate all circular dependency between such subsystems.
Information redundancy supports two key objectives of a fault tolerant behavior in an enterprise IT system: integrity and availability.
These objectives may be instrumented via information redundancy, where information is replicated and stored across multiple isolated and disparate network zones. Instead of actively preventing a fault incident, the impact zone is simply isolated and the system components are configured to access the redundant data workloads.
(Related reading: site reliability engineering.)
Modern enterprise IT architecture is designed to handle large volumes of real-time information streams, with:
All this is critical and yet — this makes the IT architecture schemes and design workflows inherently complex.
Realistically, it is challenging, if not impossible, to design an architecture that anticipates every single fault incident type, designing redundant failover subsystems with standardized integrations and interactions across every component and workload.
Instead, in order to introduce robustness and digital resilience to a complex system architecture, a dependability model can be learned: Such a dependability model can capture the evolving fault situations and failure distributions across all incident types and categories.
Enterprises already have access to vast volumes of network logs and metrics data. From this information, you can enable a probabilistic model that can learn the distribution of failures. Here, you could analyze
With such a model, you’d be able to make informed assumptions and decisions on redundancy, reliability and availability. An AI model can develop a custom failure risk profile for all system components and subsystems.
In addition to learning the dynamics of a system failure, these models can also learn to capture the evolution of fault risks. This allows users to proactively plan for checkpoints, graceful failure and recovery, storage management, redundancy and dynamic resource provisioning.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.