Tips & Tricks

October 24, 2018

3 Minute Read

Alerts and Dashboards and Searching, Oh My!

By Burch

"So you're telling me you have an employee watching a dashboard at all times? How is that not expensive?"

"So you get these emails from your alerts, but there's no action to take when you get them? How is that not spam and causing you to ignore them all?"

"So everyone emails these searches to each other to run if you want to know if the system is stable? How is that not prone to human error?"

I've come across all of these quirks, and I get why. When you're a member of a technical team, you often do odd things to keep the system up; things that once worked, but your silly human brain makes you compelled to repeat—even if it's arguably irrational. Sometimes, you're in so deep you don't notice this silliness until someone else comes along and points it out to you. If you're having trouble relating, recall the Band-Aid™ cronjobs (or Scheduled Tasks for our Windows cousins) that you've set up.

Having a stronger practice around when and why you use Splunk's searches, alerts, and dashboards can make your Splunk usage dramatically more effective.

Incident Life-Cycle

"...there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. ..." - Donald Rumsfeld, 2012

Through my collaborations with Splunk users, we've come to recognize the circumstances that make certain Splunk product features the best practice at a given time for a given goal. Furthermore, such goals (when chained together) represent the life-cycle of an incident: symptom → root cause investigation → permanent fix; or, as we know things really work at an enterprise: symptom → root cause investigation → temporary workaround & monitoring → permanent fix. That "temporary workaround & monitoring" phase may be restarting a server when a known confluence of symptoms occur. It's a necessary evil given the reality that at an enterprise, there are change windows, approvals, red tape, and political polish to get any permanent fix created and applied.

As a visual thinker, I realized we had a 2x2 matrix showing root cause in relation to an issue occurring. Kind of like a Johari Window for incidents!

	Root Cause Unknown	Root Cause Known
Issue Exists Unaware	Q0	Q2
Issue Exists Aware	Q1	Q3

This matrix did a great job of capturing this life-cycle! All is quiet (Q0) until you learn of some odd behavior. When these symptoms occur, you don't know the root cause, but you're now aware that an issue exists (see Q1) so you start an investigation to uncover the root cause (still Q1). Once that is known, you enter a cycle of checking if the symptoms present themselves (Q2), and if so, applying a fix (Q3).

Splunk Features by Quadrant

So how might Splunk help here? Let's start by labeling each quadrant in accordance to what we've outlined thus far:

	Root Cause Unknown	Root Cause Known
Issue Exists Unaware	listening	monitoring
Issue Exists Aware	investigating	attacking

By recognizing what we know and don't know, we can identify what action to take in each phase.

Listening: Think of this as status quo. Business as usual. While you go about your every day activities, you may learn about a confluence of symptoms that are compelling. This discovery could occur as formally as an incident ticket that lands at your desk or as subtly as you merely noticing patterns or behaviors that, while you could not have anticipated, you know just aren't right. Think of the latter as noticing a Splunk dashboard or glass table that seems abnormal. The point being that a dashboard or glass table are great for exposing the simultaneous patterns and behaviors of your symptom that individually may be innocuous, but together (with your innate technical background and knowledge of your systems) tells you something is worth investigating.

Investigating: So you jump in. Clicking around, exploring the machine data. Pulling in additional evidence. Whatever it might be, you're spelunking now! There's no guidance for this issue since it's not a known root cause, so you are flexing your ninja skills and doing your best SPL. Eventually, you'll discover the root cause and the specific symptoms that correlate with the issue. You know that you can save that SPL into alert, or some type of monitoring .

Monitoring: In parallel to getting a fix going, you can craft your SPL into a clever search to notify you when the symptoms occur. With your scheduled search now in place, you can rest assured that should the issue present itself again, you'll be alerted. And when that happens, you have some instructions on what action to take so you can be attacking!

Attacking: You've created an actionable alert or even a scripted or automated response. Either way, when it's triggered, you're attacking. Now that you know the issue is occurring AND what the cause is, you can work to resolve it.

Now let's say the same thing, but oriented by feature:

Dashboard and glass table: Great for exposing confluence of symptoms you never previously considered.
SPL: Great for investigating
Scheduled searches and actionable alerts: Great for monitoring for known symptoms

Applying Concepts

If someone is watching a dashboard for known symptoms, try a scheduled search. If there are alerts that are informational, try using a dashboard. If you're sharing SPL, save it as a report. And lastly, reserve your SPL for forensics.

These are not ultimatums, but rather practices for which you can sanity check your approach and align features with your current goal.

I'll close with a song that comes to mind, The Splunker by Kinnie Rojyrz:

"You got to know when to search 'em,
Know when to alert 'em,
Know when to dashboard,
And know when to run."

Burch

Burch is what happens when you mix a passion for technology with a love for performing comedy. If you find a Burch in the wild, engage lovingly with discussions of Splunk Best Practices and your hardest SPL challenges.

Tips & Tricks 4 Min Read

Dashboard Studio: Tabbed Dashboards

In Splunk Cloud Platform 9.2.2406, Dashboard Studio introduces a number of new features that will streamline and improve your dashboard designing experience.

Tips & Tricks 3 Min Read

Tech the Halls - Part 2: Naughty and Nice

In this second installment of our festive-themed series, we take you to the SOC, or Santa Operations Centre for those not familiar. This is where Santa and his elves monitor the delivery operations on the big day, using past Christmas data sets to optimise operations. From monitoring Key Performance Indicators from the CRM (Christmas Reputation Matrix) system to analysing the Naughty or Nice Lists, Splunk’s lookup functionality plays a starring role in ensuring the right presents go to the right children.

Tips & Tricks 3 Min Read

Find the Unusual with the Splunk App for Behavioral Profiling 2.0

We're excited to announce the release of the Splunk App for Behavioral Profiling 2.0 with a variety of capabilities providing enhancements across the application workflow in response to customer feedback.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Alerts and Dashboards and Searching, Oh My!

Incident Life-Cycle

Splunk Features by Quadrant

Applying Concepts

Related Articles

Dashboard Studio: Tabbed Dashboards

Tech the Halls - Part 2: Naughty and Nice

Find the Unusual with the Splunk App for Behavioral Profiling 2.0

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram