If you’ve ever seen the funny Abbot and Costello bit called “Who’s On First,” you would recall the hysterical banter between the two as Costello asks Abbot who is playing on the St. Louis baseball team. Abbott laments about how peculiar ballplayer names have become, sharing St. Louis ballplayer names like ‘Who,’ ‘What’ and ‘I Don’t Know.’ An excerpt goes a little like this:
Abbott: I’m telling you. Who’s on first, What’s on second, I Don’t Know is on third.
Costello: You know the fellows’ names?
Abbott: Yes.
Costello: Well, then who’s playing first?
Abbott: Yes.
Costello: I mean the fellow’s name on first base.
Abbott: Who.
Costello: The fellow playin’ first base.
Abbott: Who.
Costello: The guy on first base.
Abbott: Who is on first.
Costello: Well, what are you askin’ me for?
It goes on for a bit until it ends with a frustrated Costello emphatically stating, “I DON’T CARE” to which Abbott replies, “Oh, that’s our shortstop!”
On-call rotations can be a little like Abbot and Costello’s Who’s On First routine–that is, if you don’t have the right incident management software. That’s where Splunk On-Call comes in: to demystify on-call and make incident routing a breeze! Who’s on-call? Let’s find out.
Let’s start with a quick high-level overview of Splunk On-Call so that we are all on the same playing field. (Ha! See what I did there?) Splunk On-Call is an incident response software that allows teams to maintain a culture of high availability without slowing down the innovation process. Setting up Splunk On-Call requires just a few steps, and in this article, I’m going to walk you through a few of the ins and outs (did it again!) of setup highlighting how On-Call can help you simplify your response workflows.
To set up On-Call, we need to do a couple of things:
And, if you really want to show off your Gandalf wizard-like skills, you can also enable the Rules Engine, which allows you to trigger custom actions, such as annotating alerts with images, links and notes, overwriting alert fields or even adding new fields when certain conditions are met.
Perhaps you are a leader in IT Ops in your organization, tasked with overseeing IT Infrastructure and Operations, IT Shared Services or IT Service Delivery and you are seeking to streamline process while at the same time improve your operational costs. In this blog, we are going to show you how Splunk On-Call features can help you achieve these goals, focusing on steps three and four: setting Escalation Policies and configuring Routing Keys. And, I will even pitch in (again, with the baseball puns) a few best practices because, well, who doesn’t love best practices?
An Escalation Policy governs which incidents are routed, to whom incidents are routed, and how incidents are escalated. Escalation Policies live under Teams, along with the team schedule and on call rotations. Because Escalation Policies live under Teams, every team must have at least one Escalation Policy. To access Teams, click the Teams navigation item in the top navigation menu.
You may be thinking, “Wait, doesn’t the Routing Key govern the routing of incidents?” You are correct! However, Escalation Policies and Routing Keys work hand-in-hand, similar to road signs and traffic lights. The traffic light will tell you when to go (Escalation Policies) and the road signs will tell you which way to go (Routing Keys).
Best practice: Set a minimum of three escalation paths: the on-call user, previous/next user in a rotation, and a manager/team lead.
To add an Escalation Policy, you will navigate to the team for which you are creating the policy. From the Escalation Policy tab on the Team’s profile, simply click the Add Escalation Policy button.
For the Policy Name, you will want to name the policy so that it is easy to find when assigning it to a Routing Key. (A little ‘Best Practice’ gem hidden right there!) For example, let’s say I have a Routing Key called ‘IOS’ that is used to receive IOS-related alerts. In my Mobile Application Team, I should name the Escalation Policy “IOS” to easily map my IOS Routing Key to the IOS Escalation Policy for the Mobile Applications Team:
Moving down the page, just below the Policy Name field, you will see a checkbox titled, “Ignore Custom Paging Policies.” This check box essentially allows the Escalation Policy to bypass any custom paging policies a user may have set in their profile. For example, let’s say one of your engineers has set a custom paging policy in their profile so that they only receive push notifications from midnight through 6 AM vs. getting an SMS or a phone call because they don’t want to be woken up for non-urgent alerts in the middle of the night. (Can you blame them?)
However, let’s say you work at a nuclear power plant and the Escalation Policy is for handling incidents for the plant’s cooling system, something pretty important at a nuclear power plant. (Although, I sort of feel like EVERYTHING is important at a nuclear power plant). If an incident comes in for the cooling system, it’s an all-hands-on-deck situation regardless of any custom paging policies the engineers have set up in their profile.
Therefore, you might want to select the Ignore Custom Paging Policies option for this type of Escalation Policy. Use with caution, because if every alert is ‘critical’, developers might just start ignoring alerts.
The next step in creating an Escalation Policy is setting up Steps. (Best Practice forthcoming!) When setting up Escalation Policy steps with Splunk On-Call, it’s a good idea to select “immediately” for the first step unless you want a waiting room type policy. Also note that with each step you add, the time delay is cumulative. For example, if you set a time delay of five minutes for Step two and a time delay of 10 minutes for Step three, the Escalation Action performed for Step three will actually occur 15 minutes after the incident arrives in Splunk On-Call.
In the Step Actions drop-down, you will notice quite a few options. For this article, we will use “Notify the on-duty user(s) in rotation. You can read more about the other Escalation Policy Actions in our Splunk On-Call documentation.
Great! Once you’ve set up your Escalation Policy, you need to attach it to a Routing Key.
The best way to think of routing keys is like a triaging system at the ER. As patients arrive at an ER, they are routed for care after being assessed. Routing keys are essentially the same thing. An incident comes into Splunk On-Call and the Routing Key tells the incident where to go. Sometimes at an ER, just one doctor will attend to the patient, perhaps an orthopedist. And other times, multiple doctors may attend to the patient, like an orthopedist and a plastic surgeon. Similarly, Routing Keys can have one or more Escalation Policies assigned to them.
There are only three things you need to do when creating a Routing Key in Splunk On-Call:
So let’s look at each of these steps in detail.
Naming your Routing Key
Integrations that send incidents to Splunk On-Call use the Routing Key as the identifier for the On-Call API. And we all know how picky URL syntax can be, so when naming your Routing Key, it’s best to keep it to just letters, numbers, hyphens and underscores. And even though the old Abbot and Costello bit is hilarious, please don’t name your routing key Who, What, or I Dunno.
Multi-Responder Incident Response
When creating your Routing Key, you will notice a Multi-Responder check box. If checked, an acknowledgment will be required from a member of each defined Escalation Policy before the incident becomes fully acknowledged. Going back to our ER analogy, this would mean each attending physician would need to sign off on the patient before the patient could be discharged. Remember, in Splunk On-Call, a Routing Key may have multiple Escalation Policies. Checking the multi-responder box essentially means each team via the assigned Escalation Policy will need to acknowledge the incident to have it fully acknowledged.
Assign the Routing Key to Escalation Policies
The last step is to assign one or more Escalation Policies to your Routing Key. To assign an Escalation Policy, simply select it from the drop down.
And voila! You have set up your very first Incident Response Workflow for On-Call. Now Who’s On Call? Read our product documentation to learn more and get started today!
< | Previously: How Splunk Observability Cloud Helps To Alleviate Developer Burnout | Next up: Coding Conundrums and the Rabbit Invasion: How to Avoid Disaster in Your Production Environment (Coming soon!) | > |
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.