If there is anything that frustrates IT users, it is repeated issues that seem to persist without any reasonable explanation about their cause or effective and permanent resolution. Service disruptions, whether due to slow responsiveness or corrupted data, are an inevitable part of IT. However, when these issues become recurrent, they can leave a lasting bitter impression.
Ask Microsoft who experienced repeat issues with their Azure Front Door service on 30th July and 5th August this year. What differentiates mediocre and mature IT teams is that the latter consistently focus on anticipating and preventing such issues before they happen.
Problem management, as defined by ITIL 4 guidance, is the practice responsible for reducing the likelihood and impact of incidents by identifying actual and potential causes, as well as managing workarounds and known errors.
The focus here is not on quickly restoring services to normal—that is the role of incident management. Instead, the emphasis is on investigating the root causes of incidents and implementing measures to contain or eliminate them, a process that may require more time.
The value of this practice comes from:
Problem management is carried out in 3 main phases:
Identifying problems is carried out in two different approaches:
This approach is a reaction to incidents that have already happened and involves investigating their symptoms and then unearthing their causes. The main drivers for reactive problem management are contributing to the resolution of open incidents, as well as prevent recurrence.
For example, repeated instances of API gateway timeouts will be investigated for possible network issues, misconfiguration, or unresponsive servers.
This approach has the objective of preventing incidents before they occur. It involves analysing information to identify latent incident causes before they lead to a service disruption, drawing from:
For example, a vendor shares information of a newly discovered vulnerability, or developers unearth a bug while building the next feature update. Once this cause is identified, the risks are analyzed and a response to minimize the incident likelihood or impact is prepared.
Reactive problem management is the more common of the two problem identification techniques. However, as organizations mature in their problem management capability, it becomes more desirable to invest in proactive problem management. The challenge lies in quantifying the value to the business, as prevented incidents and intangible resolution actions can be difficult to measure.
Once a problem has been identified, the first step to control is registering it in readiness for detailed investigation. A problem record is created in the organization’s chosen mechanism (spreadsheet, ticketing system, or case management tool) by a designated problem management practitioner.
Problems should be recorded separately from incidents since their focus is different, and the timelines involved take much longer. The general information captured at this step includes:
Once the problem is registered, the main activity of problem control kicks in i.e. root cause analysis where information on the IT system and underlying components is analyzed to trace the cause of the causes, until the root of the problem is unearthed. It is important to note that in some cases, there may not be just a single root cause but several.
Apart from IT components, root cause analysis should also consider other factors such as:
Since no one person can have all the skills and information to look at a problem from multiple angles, problem solving is best done using a multidisciplinary team of technical and business experts according to ITSM.express guidance.
There are many techniques for conducting root cause analysis, and it is crucial that organizations train their tech teams on how to apply them and understanding how to select the right technique for a given situation. The ITIL v3 Service Operation publication provides some guidance on selecting techniques as seen below:
Problem situation | Suggested analysis technique |
Complex problems where a sequence of events needs to be assembled to determine exactly what happened | Chronological analysis, Technical observation post |
Uncertainty over which problems should be addressed first | Pain value analysis, Brainstorming |
Uncertain whether a presented root cause is truly the root cause | 5-Whys, Hypothesis testing |
Intermittent problems that appear to come and go and cannot be recreated or repeated in a test environment | Technical observation post, Kepner–Tregoe, Hypothesis testing, Brainstorming |
Uncertainty over where to start for problems that appear to have multiple causes | Pareto analysis, Kepner–Tregoe, Ishikawa diagrams, Brainstorming |
Struggling to identify the exact point of failure for a problem | Fault isolation, Ishikawa diagrams, Kepner–Tregoe, Affinity mapping, Brainstorming |
Uncertain where to start when trying to find root cause | 5-Whys, Kepner–Tregoe, Brainstorming, Affinity mapping |
When a problem has been analyzed but yet to be resolved, it is designated the status “known error”. Should the investigation reveal that the root cause was addressed during incident resolution, then the problem record is closed at this point. However, if there was a short-term measure applied to reduce the impact or likelihood of incident recurrence, then this is recorded as a workaround.
Workarounds are extremely useful in helping to resolve further incidents faster and should be properly documented and communicated to first level support teams. An example of well documented workarounds is by AWS for its IVS real-time streaming Android broadcast SDK service which lists known issues and associated workarounds.
(Related reading: incident response plans.)
The last phase of problem management is error control where the problem record is eventually closed after one of the two options is applied:
The ideal scenario is when a permanent solution to eradicating the root cause is identified and implemented. This could involve a myriad of actions such as system reconfiguration, migration, change of modules, patching/upgrades, updates to policies, enhancement of controls, etc. Depending on the IT problem being addressed, several practices would need to be applied during the resolution:
Some organizations see fit to maintain permanent workarounds as their error control. The reasons behind this may be driven by budget, risk, legacy infrastructure, target architecture, vendor advice, and other perspectives. However, the use of permanent workarounds to prevent incidents may inadvertently lead to increased technical debt. Known errors should be regularly reviewed to identify if their context has changed that allows a shift from workaround to permanent solution.
Problem management is a practice that many organizations struggle to prioritize, often overshadowed by the fast-paced demands of deploying features and restoring services. Its strength lies in helping IT functions evolve beyond a reactive, firefighting mode to more effective system design and maintenance. Achieving this transformation, however, requires a comprehensive and strategic approach.
To reach higher levels of maturity, leadership must actively promote proactive problem management and integrate its metrics into executive dashboards. Investments in upskilling the technology workforce, guided by frameworks such as SFIA, are essential. Additionally, organizations should deploy technologies that enhance problem investigation, including tools with observability and machine learning capabilities, to support more efficient and effective problem resolution.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.