The 1.7 release of the Splunk App for Content Packs comes with a slew of new awesomeness for the Content Pack for ITSI Monitoring and Alerting designed to bolster your IT operations team’s visibility and AIOps posture! Previous versions of the content pack focused on making it easy for you to create and group Notable Events from ITSI Services and third-party monitoring tools. This new version provides unparalleled analytics about these new alerts and episodes to provide IT operations teams intelligent and comprehensive visibility to help answer challenging questions such as:
Read on to learn more about the key enhancements and features we’ve created in this version of the Content Pack for ITSI Monitoring and Alerting.
The content pack now ships with a prebuilt service tree which proactively monitors incoming alert volumes as well as episode creation volumes giving you at-a-glance visibility into the overall health of systems being monitored. As incoming alert volumes and episode creations rise, the other KPIs within the service tree allow you to slice and dice these increased volumes across several key dimensions. This helps users to quickly triage what may be causing the elevated alert levels. For instance, in the image above, we are viewing which alert signatures are contributing to the increase in alerts, and we can clearly see that the “Automation Agent Status” check has risen suddenly and is producing a large volume of the incoming alerts. The purpose of each service is described below:
Service Name |
Service Purpose |
ITSI Event Analytics Service |
This is the parent service of the other two services and serves as the top-level node of the alert and episode monitoring service tree. |
ITSI Alert Analytics Service |
This service tracks incoming alerts and changes to critical status when the volume of incoming alerts rises significantly higher than historical baselines. The service also splits incoming alerts by several key fields to help operations teams quickly identify what values may be contributing to the incoming alert volume. An included ITSI Alert Analytics Template supports greater customization of the ITSI Event Analytics service tree. |
ITSI Episode Analytics Service |
This service tracks newly-created and open episodes. It changes to critical status when the volume of newly-created episodes rises significantly higher than historical baselines, or when the number of open critical episodes rises significantly higher than historical baselines. The service also splits episodes by several key fields to help operations teams quickly identify what values may be contributing to the episode volume. An included ITSI Episode Analytics Template supports greater customization of the ITSI Event Analytics service tree. |
The Alert Storm Detection and Episode Storm Detection KPIs are solely responsible for the detection of alert and episode storms. When these KPIs rise to high and critical status, as seen in the image below, the system proactively identifies and alerts the IT operations team about an active Alert Storm via the action rules of a new aggregation policy called ITSI Alert and Episode Monitoring.
The ITSI Alert and Episode Monitoring aggregation policy was built to provide a rich triage experience for active Alert Storms as seen in the image below. Within the ITSI Alert and Episode Storm Activity saved episode view, IT operations teams can see heads up metrics about incoming alerts via the customized episode view dashboard. Active storms will appear as new episodes within this view. Further triage of the cause of the storm can be done by clicking on the ITSI Alert and Episode Storm Activity detected episode.
After clicking into the active Alert and Episode Storm episode, analysts can click into the episode dashboard tab to review the out-of-the-box dashboard that was built specifically to expedite triage of the alert storm cause as seen below. In this case the pattern matching ML algorithms have identified a high concentration of “Automation Agent Status” alerts which comprise just over 43% of all alerts in this alert storm. The “Probable Cause of Storm” pattern matching algorithm will identify and surface up tightly packed clusters of related alerts regardless of how these alerts are being grouped into episodes by aggregation policy rules. You can think of this functionality as “Smart Mode on the Fly!”
If even deeper triage is necessary to identify the cause of the storm, the panels in the episode dashboard drill down to out-of-the-box Splunk dashboards that provide maximum visibility into the alerts and field value distributions in the alert storm. Within these dashboards, unusual and lopsided distributions of field values can be easily discovered and will help you focus your subsequent investigation. For instance, in the dashboard below it’s clear that the alerts are heavily focused around “IL-DC-2”, “Automated Agent Status”, and “Website Revenue”.
Pro Tip: While these dashboards were intended to help triage the cause of alert storms, they are extremely useful for analyzing alert trends over any window of time. Curious how the volume of incoming alerts has been trending over the last 30 days? The ITSI Alert and Episode Volume Trend Analysis dashboard will provide that information with incredible fidelity. Curious which alerts are most commonly occurring in your environment over the last 7 days? The ITSI Alert and Episode Field Values Analysis dashboard can help answer that question too.
In addition to the real time alert and episode analysis above, IT operations teams and Operations Center managers need to perform historical analysis about incoming alerts and track the SLAs of their team. To facilitate this analysis, the content pack ships with the Event and Incident Operations Posture dashboard as seen below. The dashboard helps IT operations managers answer critical questions such as:
While the alert and episode analytics functionality above steals the show, we have packed several other goodies into this version of the content pack that we’re sure you’ll love. See the Release Notes for a full list of all the new features and bug fixes.
To get this new functionality please head out to Splunkbase to get version 1.7 of the Splunk App for Content Packs and be sure to review our latest documentation for the ITSI Content Pack for Monitoring and Alerting to see how to upgrade to and configure the new functionality. Also, stay tuned for several upcoming Tech Talks where we will review and demo this functionality live.
Enjoy the new features, and as always, Happy Splunking!
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.