The question isn't whether an incident will happen: it's when it will happen. Systems will crash. Software will fail. Vendors will suffer an outage of their own. It's your job to be prepared for these problems, and incident severity levels are one of the tools you need.
Incidents have varying impacts on your business and customers. Incident severity levels are how you classify their impact and manage your response. When you use severity levels properly…
In this article, let's look at what incident severity levels are, how to use them and how they differ from priority levels.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
A vital part of the incident management practice, severity levels measure how acutely an event impacts your business. Whether an event is internal, such as equipment or software failures, or external, such as a security breach or a vendor outage, it has a specific effect on your ability to serve your clients. The severity level reflects that impact.
(Manage security incidents events better with these SIEM features.)
Depending on the organization, severity levels commonly range from one to three, four or five. With one, or SEV 1, being the most severe and the highest number in your system (3, 4 or 5) being the least severe.
There's no universal definition for severity levels. How you define them depends on what's important to your organization and your users. For some companies, only three levels make sense. For others, dividing incidents into five may be a better idea. Here are definitions for five levels:
Severity Description | |
SEV 1 | A critical incident that affects a large number of users in production. |
SEV 2 | A significant problem affecting a limited number of users in production. |
SEV 3 | An incident that causes errors, minor problems for users, or a heavy system load. |
SEV 4 | A minor problem that affects the service but doesn't have a serious impact on users. |
SEV 5 | A low-level deficiency that causes minor problems. |
When an incident occurs, your teams need to know:
For example, when an outage occurs that affects all users, a typical response is "All hands on deck!" But having everyone focus on a single problem isn't productive. It's usually counter-productive and leads to duplicated or even contradictory efforts and confusion. Defining a severity level and attaching processes to it leads to a better response. (Even better: Designate an Incident Commander so you already know who's calling the shots.)
Defining severity levels should be a part of your incident management plan. They can go a long way toward answering these questions in advance and saving your team's time since they know what to do as soon as an incident is assigned a level.
(Check out these incident review best practices.)
Using our questions above, let’s see what the answers to a SEV 1 incident might be:
While for a SEV 5 outage, the answers are very different:
Severity levels are a common reference for everyone involved in responding to incidents. With an assigned level and a clear set of procedures, the right teams get to work on clearing the issue. Without them, you'll either lose time working out the rules of engagement or create more issues by not having them.
(See how Splunk solutions support the entire incident management practice.)
From a distance, severity and priority look like the same thing. If you have a SEV 1 incident, it's obvious that you're going to clear it before a SEV 2, so what's the difference between severity and priority?
Priority and severity often match up perfectly. An outage that prevents all users from using a service is both high priority and SEV 1. This is an example of technical issues and business priorities being in alignment. But sometimes these priorities don't align:
Even while these different classifications can be at odds, they're both important methods of communication. Severity tells stakeholders how serious an issue is. Priority tells technology staff what they need to work on next.
(Track more incident response metrics.)
Incident severity levels are a simple enough concept. Unfortunately, simple doesn’t mean easy to implement. You can't copy them from a blog post or white paper and immediately put them into use. You need to adapt them to your organization by taking several factors into consideration, such as:
Still, these best practices can help your organization define (and adhere to) incident severity levels.
Best practice: Adopt a unified set of levels and descriptions for your entire company.
Using different incident severity levels for different applications or software stacks, especially if you're in a large organization, might look like a good idea. But it will complicate one of the biggest benefits of creating the levels in the first place: clear communication about incidents. Different levels or definitions will make it hard for stakeholders to understand what an incident means. It may even confuse engineers and developers that work on different applications.
Best practice: Use the smallest number of severity levels you can. No more, no less.
Too many will quickly become confusing. One reason incident security levels exist is so that when an incident occurs, you can assign it a level and get to work. Too many levels will slow this down. Too few will lead to lumping incidents together. Subtle (or even not so subtle) nuance between incidents will disappear when they're forced into the same category.
How do you get it right? Get the stakeholders together and come up with a plan. Go over past incidents and see how they fit into a proposed framework. Examine previous root cause analyses. Try it out and don't be afraid to change your scheme if you need to.
Best practice: Make it easy to assign severity levels
If your organization can't quickly assign the right severity level to an incident, you won't reap the advantages of having a system in place. So, you need specific rules on how to assign them that not only make it easy, but self-evident. You don't want to waste time arguing over the severity of an incident.
You need to designate the level and get to work. So, create rules that rely on measurable impact, such as:
Now you've got a great understanding of incident severity levels and how to use them. Effectively, these levels are communication tools, so you can share the impact of a problem and quickly get the right teams engaged to solve it. Of course, severity and priority are related in incidents, but they are still very different.
(For the latest in all things security, check out these Cybersecurity and InfoSec Events & Conferences.)
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.