If you held a competition to identify a term that describes the state of the world, probably VUCA would emerge as one of the leading contenders. The acronym — short for volatility, uncertainty, complexity, and ambiguity — is a great description of how unpredictable life is, and how disruption can overturn the stability we can become used to.
Natural or manmade disasters can occur at any time, resulting in damage, loss, or impairment that affects organizations from meeting their objectives and satisfying the needs of their stakeholders. The need to build resilience against such disruption is an essential competence that can both:
Enter: the service continuity management practice. As outlined in the ITIL® 4 framework, the practice of service continuity management helps to ensure a service provider’s readiness to respond to all kinds of disruptive events that may impact core activities — and your credibility.
In this article, we will look at the concepts, processes, and measures that should be well understood by any student of service continuity.
A framework for building organizational resilience, service continuity management helps an organization to:
Primarily a proactive measure, service continuity management is designed to prepare and organize the people, infrastructure, systems and resources required to predict and counter the negative effects resulting from a disaster.
(Related reading: business continuity vs. business resilience & how Splunk delivers business continuity, so you can go from disruption to resilience in no time.)
Disruptions come in many shapes and forms. An earthquake in Japan bringing down mobile communication services. A COVID-19 outbreak infecting air traffic control staff leading to cancellation of flights at London Gatwick Airport. An outage on the FAA’s NOTAM system resulting in thousands of US flights being canceled or delayed.
No matter the source of a disruption or its magnitude, users and other stakeholders expect a service provider to continue providing services at acceptable predefined levels. Time is of the essence when it comes to recovery and resumption of operations, so the service provider is expected to put in place mechanisms to ensure the enterprise is ready to swiftly respond to any incident or disaster once it occurs.
Service continuity supports the overall business continuity management from the perspective of operational risks. The ISO 22301 standard for business continuity management systems outlines two main processes that serve as the basis for planning for service continuity:
The information from these two processes helps in informing the service continuity requirements which are usually outlined as target timelines. These include:
Recovery Time Objective (RTO): The maximum period of time following a service disruption that can elapse before the lack of business functionality severely impacts the organization. This is the maximum agreed time within which a product or an activity must be resumed, or resources must be recovered.
Maximum Acceptable Outage (MAO): The time it would take for adverse impacts, which might arise as a result of not providing a product/service or performing an activity, to become unacceptable. The MAO is longer than the RTO by an amount which accounts for the organizational risk appetite.
Recovery Point Objective (RPO): The point to which the information that is used by an activity must be restored in order to enable the activity to operate effectively upon resumption. This point is defined by time prior to disruption where information loss is acceptable.
Service Continuity Requirements
The continuity requirements inform the service continuity strategies.
Business stakeholders would prefer that their IT systems have the lowest levels of RTO and RPO (e.g. under 10 seconds or less), but they should be well informed that to get faster recovery with low data loss requires additional resources and configurations. For example: maintaining a disaster recovery site that has real time replication of all information in a primary site or cloud can run into millions of dollars, depending on the continuity requirements.
Therefore, set your continuity targets on an application-by-application basis, since each application has a direct correlation with operational complexity and implementation cost. (For instance, cloud providers such as AWS provide guidance in setting resilience policies including RTO/RPO targets per application.)
Service continuity strategies should take both proactive and reactive postures that ensure that the enterprise’s service delivery mechanisms are adequately protected, and mitigation mechanisms can respond to and manage impacts of disruptive events.
A strategy must be supported by at least one solution which includes approaches, arrangements, methods, procedures, treatments, and actions to be carried out to implement the strategy.
Examples of continuity strategies outlined within the Business Continuity Institute’s Good Practice Guidelines include:
Once the enterprise has decided the preferred service continuity strategies, the relevant operational teams document the service continuity plan. This plan contains the detailed guidance to:
The continuity plan facilitates timely warning and communication to relevant stakeholders, and provides the information required to effectively respond to a disruption. The ISO 22301 standard states that the contents of the plan should be specific, flexible, focused, effective in minimizing impact, and have clear assignment of roles and responsibilities.
According to the EU Agency for Cybersecurity (ENISA), there are four stages that are covered in an IT service continuity plan:
Information to include. Some of the information contained in the service continuity plan includes continuity requirements, IT architecture, roles and responsibilities, invocation and damage assessment procedures, communication approach, escalation matrixes, recovery and fail-back procedures, test plans, contact details, dependencies, resources, and reporting requirements.
Regular review cycles at least annually. The service continuity plan should be regularly tested and reviewed at least annually to ensure that it remains relevant in supporting the organization’s continuity objectives.
Employees, contractors, and any other stakeholder who is directly involved in the delivery of services should be trained on the continuity plans based on their role-specific competence requirements.
Implementing and maintaining service continuity plans is a significant strategic investment for any enterprise that wants to demonstrate to its stakeholders that it is resilient and trusted to continue delivering services in the face of devastation. Solutions to mitigate unacceptable risks and single points of failures should be carefully chosen to ensure they meet the service continuity requirements, while also being cost-effective, practical, and not introducing unnecessary complexity within the IT environment.
Service continuity management is not an easy undertaking and requires continued support across all management levels within the enterprise.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.