Whenever an IT service or component fails to perform as expected, or the users perceive that something isn’t right, there’s a reasonable expectation that the responsible IT team should quickly:
Failure to do so can result in negative impact for both the company and the people who use its services, sometimes with serious consequences.
Each year, approximately 10 to 20 high-profile IT outages or data center events globally cause serious or severe financial loss, business and customer disruption, reputational loss and, in extreme cases, loss of life. Just ask AT&T— their network outage on 22 February 2024 affected many customers including rescue services, triggering an investigation by the FCC.
The ability to quickly and accurately detect service outages and degradation is priceless: how quickly can your teams recover and return to normal?
The ITIL® 4 service management framework defines an event as:
“An Event is a change of state in a service or associated component that has significance in its operation”
A subset of the ITIL monitoring and event management practice, event management focuses on those monitored changes of state defined by the organization as an “event”. The practice of event management, then, is all about:
Information about events is also recorded, stored and provided to relevant parties. Events are often used in tandem with logs, metrics, and traces: MELT.
Not everything is an Event. Yes, IT monitoring is necessary for event management to take place — however, not all monitoring results in the detection of an event. Changes of state to be treated as events are determined by thresholds and other criteria.
(Event management is a critical part of cybersecurity, including modern SIEM solutions. Learn how SIEM works.)
Changes of state for services and service components occur continuously in the IT environment.
Monitoring systems may generate alerts or system logs about the status of a service or component reaching a threshold or changing, for example:
To properly handle and respond to the different changes of state, it is necessary to filter and categorize the incoming information.
The ITIL 4 framework categorizes events as follows:
Event Category | Description | Examples |
Informational events 🟢 | They provide the status of a device or service or confirm the state of a task. They signify that regular operation is occurring. They do not require action at the time they are identified. | A user login completed A transaction is successful |
Warning events 🟡 | They signify that an unusual, but not exceptional, operation is occurring. They inform the appropriate team or tool to take necessary actions before any negative impact is experienced. | Backups not running Free Disk space below 15% |
Exception events 🔴 | They indicate that a critical threshold for a service or component metric has been reached. They may indicate that a service or component is experiencing a failure, performance degradation, or loss of functionality that impacts business operations. | Network port unreachable Error rate at 100% Unauthorized file access |
Event categorization focuses attention on the events that are truly significant for the management and delivery of IT services. It ensures that operational events are tracked, assessed, and managed appropriately.
The configuration of alerts and their thresholds is a critical activity in supporting event categorization, especially when drawing the fine line between warning and exceptional events. For instance:
Setting up a standard classification scheme for events will enable a common set of actions to be established for each grouping, which will enable different IT teams to coordinate better responses.
(Related reading: adaptive thresholding.)
An alerting system should be characterized by:
As IT environments grow in scale and complexity, the use of multiple alerting systems may give rise to the occurrence of “over-alerting” where more alerts are generated than IT can handle, potentially causing truly significant alerts to be lost in the 'alert noise'.
By investing in the right tools embedded with artificial intelligence operations (AIOps) and machine learning (ML) capabilities, the aggregation, correlation, and filtering of numerous alerts can mitigate against this risk.
See how Splunk quiets all those noisy alerts:
The event handling process consists of the following activities:
Detection of events is primarily conducted through monitoring systems, where event information is queried or received from:
Once an event passes pre-set thresholds and criteria related to system and transaction status, this triggers the generation of events which the monitoring systems parses in readiness for processing.
(Related reading: application monitoring, infrastructure monitoring & endpoint monitoring.)
Logging involves the generation of the event record in the monitoring system, in order to serve as the information reference point for handling. The record will generally include:
This step is iterative in nature and involves the analysis of the event record, alongside other related records and information with a view of informing the next course of action.
(Related reading: IT event correlation.)
Here, the analyzed event is grouped according to an agreed criteria (such as priority or type) in readiness for response. The classification is informed by the earlier mentioned categories, as well as the operational context of the organization.
Based on agreed rules and plans, a pre-defined event response is then chosen. In an automated set up, the response is designed to be triggered by the selection, either:
Finally, the response is communicated to the relevant teams or stakeholders for implementation. Notifications can be sent out via common communication channels such as email, text, collaboration tools, or social media channels.
The response can involve actions that carry out a service action such as:
One of the critical success factors for event management is ensuring that events are detected, interpreted, and if needed acted upon as quickly as possible.
Considering that warning and exceptional events could foreshadow a service outage or degradation, ensuring that the right event information is shared with the appropriate persons or technology is crucial in enabling preventive or corrective actions.
Part of event analytics, some related metrics you should regularly measure and review include:
Improvement actions to reduce the occurrence of errors, noise, and associated incidents should be directly tied towards these metrics. Additionally, do encourage the regular review of tools and procedures to identify opportunities for improving event management.
Fine tuning of correlation mechanisms, filtering rules, and set thresholds should be a common practice for optimizing the IT monitoring tools to ensure that the event detection, filtering and correlation activities support the objectives of the event management practice.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.