When you notice a problem, do you solve the symptom that made you notice it, or do you try to understand, at the root, what caused it?
Root cause analysis (RCA) is the process of identifying the underlying causes of problems in order to prevent those problems from recurring. Instead of merely addressing symptoms, RCA focuses on resolving fundamental issues.
By uncovering the root causes, the outcomes of RCA are valuable over both the short- and longer-term: RCA mitigates the immediate concerns and also prevents similar issues from re-emerging. This approach leads to sustainable solutions across various fields, including IT, manufacturing, and software development.
In this comprehensive article, we'll explore how to conduct RCA, its core principles, best practices, and the tools available to facilitate this process.
(See how to use Splunk ITSI for Root Cause Analysis.)
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
Here are a few ways RCA helps in the real world:
Your car consistently runs low on engine oil, seemingly too fast. Adding oil each time masks the issue, while you're buying more oil often. Eventually, you take it to the mechanic who, after investigating the problem, discovers a faulty gasket or worn components. With the root cause fixed properly, you aren't dealing with low oil all the time.
Gatwick Airport, outside London, has a single runway that must accomodate up to 55 air traffic movements per hour. Using Splunk cloud solutions including RCA support, Gatwick's IT team has made air traffic control significantly more efficient — fewer delays, less fallout. Indoors, the IT team identified efficiency improvements to streamline security processes. Result? 95% of passengers clear security in under 5 minutes!
In Japan and online shops, Niki Golf prides itself on delivering calm experiences for its customers. To shore up their cybersecurity, where they onboarded SIEM solutions from Splunk. The initial rollout was so successful, Niki Golf now uses these solutions to automate much of the root cause analysis process, too. Today, the company — and its customers — enjoys 75% faster incident response and 50% manpower savings.
In the finance industry, TransUnion provides consumer reports, risk scores, analytical services, and more for over 1 billion customers. Their IT operations and monitoring team uses Splunk
No matter what industry you're in or what problem you're trying to solve, analyzing the root cause is important for several reasons.
In many industries — IT, healthcare, security and cybersecurity, software development, manufacturing, financial services — one mistake can be costly. It doesn't matter if the mistake is a bug in new software or downtime that causes forces an entire software system or website offline.
Incidents like these are resource-intensive to fix, result in wasted spend or lost revenue, and can damage your organization's reputation. (In industries like healthcare and finance, for example, the damage can actually result in human harm or real loss for individuals.)
Once you've fixed something temporarily, performing RCA helps to ensure that you can fix it permanently and that you won't keep dealing with this same error over and over again.
Identifying potential vulnerabilities in overperforming areas helps mitigate risks before they escalate. Understanding what contributes to success allows organizations to reinforce those elements while proactively addressing any weaknesses that could lead to setbacks.
Even in areas where a business is overperforming, it's unlikely that everything is working smoothly. Ineed, conducting RCA can help uncover underlying issues that are not immediately apparent.
For example, let's say your quarterly sales have met lately and it's exceeding performance targets. But there could still be (probably are) inefficiencies or risks that, if unaddressed, could lead to future problems.
Engaging in RCA fosters a culture of continuous improvement. By analyzing successful outcomes, organizations can identify best practices that can be replicated in other areas.
This helps ensure that overperformance is not just a temporary spike but a sustainable activity.
Involving teams in the analysis of successful outcomes can enhance morale and motivation. It empowers employees by recognizing their contributions and encourages them to share their insights about what works well.
RCA can help businesses prepare for potential challenges by analyzing successful strategies and determining how they can be adapted or strengthened in the face of change. This foresight can be invaluable in a dynamic business environment.
Conducting an RCA involves a structured process that varies across industries. Here’s a basic framework to guide your analysis.
Begin by clearly defining the problem statement and its symptoms. This may include machinery or software malfunctions, process failures, or human errors.
Isolate contributing factors to contain the problem while investigating further. Involve key stakeholders in the problem definition process to gain multiple perspectives.
Ensure that the problem statement is specific, measurable, achievable, relevant, and time-bound (SMART) to provide clear direction for the analysis.
Compile comprehensive data, including:
This information helps establish a timeline of events and identifies adverse actions that led to the issue.
You should also gather quantitative data — such as performance metrics and production levels — to understand the scope of the problem better. Consider external factors that may have influenced the situation, such as market conditions or changes in regulations, to create a more holistic view of the circumstances surrounding the issue.
To identify the root cause, you can approach it in many ways (we'll talk more about these throughout the article):
Ultimately, validate potential root causes through data analysis and evidence to ensure accuracy before proceeding.
After identifying the root cause, propose and implement effective solutions. Develop an action plan that:
Monitor these solutions to ensure they address the underlying issue effectively.
Communicate the solutions to all stakeholders to ensure buy-in and adherence to the new processes. Lastly, schedule regular follow-ups to assess the effectiveness of the implemented solutions and make adjustments as necessary.
Thoroughly document the problem, analysis, and solutions. Include recommendations for future improvements to prevent recurrence.
Create a comprehensive report that details each step of the RCA process, including data collected, root causes identified, and actions taken. (This is often known as an incident review or postmortem.) You can make this documentation accessible to all relevant parties to facilitate knowledge sharing and continuous improvement.
Finally, establish a review process to evaluate the effectiveness of the documentation and update it as needed based on new findings or changing circumstances.
There are several tools and methodologies that can be useful for conducting RCA. Each of these tools offers unique advantages depending on the nature of the problem you're dealing with. Below are some of the most commonly used RCA techniques.
One of the most straightforward and widely used RCA tools is the 5 Whys method. This technique involves asking “why?” repeatedly — often five times — to get to the root cause of a problem.
The idea is similar to how children inquire deeply about a topic, but in this case, it’s applied systematically to uncover underlying issues. This tool works best for problems with a single root cause.
To use the 5 Whys technique:
The Pareto chart is a combination of a bar and line chart, particularly effective when a problem has multiple causes. The chart visually prioritizes these factors by displaying them as bars in descending order, with a line graph plotting the cumulative impact. It’s especially useful for identifying the most significant factors that contribute to defects in quality control or operations.
In practical terms, Pareto charts help you focus on the "vital few" causes of a problem, based on the Pareto Principle (80/20 rule), where 80% of the effects come from 20% of the causes.
Change analysis or event analysis is another valuable method for RCA, particularly when a problem seems to occur after a specific event or change. This approach compares what happened before, during, and after an incident to determine what changed and why the problem occurred.
Steps for conducting change/event analysis:
This method is especially useful when you're dealing with complex systems where multiple variables interact and where a particular event is suspected to have triggered the issue.
A scatter diagram (or scatter plot) helps identify the relationship between two variables, which can clarify whether specific causes affect a problem. This technique uses data points plotted on a graph to check for patterns, often following work done with fishbone diagrams or the 5 Whys.
To create a scatter diagram:
If a clear pattern (like a line or curve) emerges, there is likely a correlation between the variables. If not, the relationship is probably weak or non-existent.
The Fishbone diagram, also known as the Ishikawa diagram, helps visualize the possible reasons behind a problem, making it easier to identify the root cause. Created by Professor Kaoru Ishikawa in the 1960s, this tool is recognized as one of the seven basic quality tools according to the American Society for Quality.
The diagram resembles a fish skeleton, hence its name! The head of the fish represents the problem, and the ribs illustrate categories of potential contributing factors. From each rib, smaller bones indicate possible causes within those categories, providing a structured approach to identifying the various elements that contribute to the issue.
Root cause analysis (RCA) is a crucial part of improving business processes. A common approach to RCA is found in the Six Sigma methodology. Six Sigma focuses on making processes more efficient and effective by identifying and eliminating defects, minimizing variability, and improving overall consistency.
A key part of Six Sigma is the DMAIC framework, which is used to enhance existing business processes. The steps in DMAIC are:
In the "Analyze" phase, Six Sigma uses several types of analysis, including source analysis, which involves a simple, perhaps simplistic, three-step RCA process:
Six Sigma techniques are widely used in areas like IT operations and software development. By applying these methods, organizations can identify the causes of system failures, high defect rates, missed deadlines, or other issues that affect product quality and customer satisfaction.
(Related reading: IT failure metrics.)
To conduct root cause analysis effectively, consider these best practices.
RCA should be grounded in data and evidence — not assumptions. Encourage team members to focus on facts, statistics, and historical data to ensure accurate results. Use relevant documentation, such as incident reports and performance metrics, to support findings.
Pro tip: Remind team members that assumptions can lead to misdiagnosis of the problem and ineffective solutions.
A single problem can have multiple root causes or contributing factors. Therefore, it’s important to examine all possibilities over a broad time frame. Utilize techniques like brainstorming sessions and mind mapping to generate a comprehensive list of potential causes. This approach helps in uncovering the true cause and avoids the oversight of less obvious factors that could be contributing to the issue.
Engaging with various stakeholders throughout the organization can also help identify different perspectives on the problem.
Include members from different departments and roles in the RCA process. This diversity ensures that varied perspectives and potential solutions are brought to the table. Diverse teams can challenge conventional thinking and generate more creative and effective outcomes. Additionally, involving team members from various levels of expertise can facilitate knowledge sharing and promote a deeper understanding of the issue at hand.
(Related reading: cybersecurity roles & DevOps roles.)
Effective brainstorming and problem-solving typically happen with small groups — ideally 5-10 people.
To facilitate productive discussions, consider using breakout sessions for larger teams or rotating members in and out for focused brainstorming efforts.
RCA should get more granular with each step of the analysis. Utilize each new piece of evidence to dive deeper into the problem. Employ tools such as the 5 Whys or Fishbone diagrams to encourage in-depth discussions. By systematically peeling back layers of the issue, teams can uncover the actual root cause and not just treat the symptoms. This thorough examination will lead to a more comprehensive understanding of the problem.
Many issues are often rooted in human error, and addressing them requires a non-punitive approach. Ensure that everyone understands that RCA is not about assigning blame but rather about solving the problem collaboratively. This culture of openness fosters full participation and honest feedback, enabling team members to share insights without fear of retribution. To reinforce this environment, leaders should model the desired behavior and communicate that the focus is on process improvement.
After completing RCA, the focus should shift to preventing recurrence. Document the findings clearly and create a detailed action plan that includes recommendations for process changes, training, and updated documentation. Adjust processes based on insights gained during the analysis, and provide necessary training to staff to minimize the likelihood of future issues. Furthermore, establish metrics to monitor the effectiveness of these preventive measures over time, ensuring that the problem does not reoccur.
The opposite of best practices isn't exactly performance gaps, but it's good to know the challenges you may face in RCA.
The term "performance gaps" refers to the discrepancies between actual performance and desired performance levels. These gaps often show up such as productivity shortfalls, quality defects, missed deadlines, or customer dissatisfaction. Recognizing these gaps is a critical first step in conducting an effective root cause analysis — here's how:
Identifying areas for improvement. These performance gaps not only signal where processes or outcomes are falling short but also highlight the specific issues that require investigation. By examining these gaps, organizations can effectively direct their RCA efforts toward the most important challenges that are affecting performance.
Driving the RCA process. The existence of a performance gap often catalyzes the RCA process. When a business identifies that its performance is not meeting established targets, it prompts a deeper examination to uncover the underlying root causes. This proactive approach not only addresses immediate deficiencies but also helps organizations avoid similar issues in the future.
Understanding contributing factors. Performance gaps frequently arise from various contributing factors, including inadequate training, resource limitations, or process inefficiencies. By analyzing these gaps through RCA, teams can pinpoint not just the root causes but also the broader issues that contribute to these discrepancies. This comprehensive understanding is crucial for developing effective and sustainable solutions.
Continuous improvement and monitoring. Addressing performance gaps through RCA leads to the implementation of corrective actions that resolve immediate problems while fostering a culture of continuous improvement. Furthermore, monitoring performance metrics after implementing these solutions ensures that they are effective and that gaps do not re-emerge over time.
(Related reading: continuous monitoring & continuous performance management.)
Once you've completed the root cause analysis, the next crucial step is to implement the necessary changes to prevent future issues. Ultimately, RCA isn’t about fixing what’s broken — it’s about ensuring continuous improvement and optimizing processes for long-term success. Here are some steps to take after your RCA is complete.
Accurate documentation is essential for ensuring that all stakeholders understand the issue, its root cause, and the implemented solution. This documentation can serve as a reference for future incidents, enabling teams to respond faster if a similar issue arises. Moreover, documenting lessons learned provides valuable insights that can improve decision-making and reduce the risk of repeating the same mistakes.
Often, RCA reveals weaknesses or inefficiencies in existing processes. Once the root cause has been identified, teams should review and adjust operational procedures to reflect new findings.
Process changes can range from minor tweaks to complete overhauls, depending on the severity of the issue. By improving the process, you not only fix the current problem but also reduce the likelihood of encountering similar issues down the line.
Human error is often a contributing factor to problems. After adjusting processes, it’s essential to ensure that all relevant team members receive the necessary training. Training ensures that employees understand new procedures and are equipped to prevent future errors. This step is vital for embedding improvements into the company culture and making sure everyone is aligned on how to avoid past mistakes.
After implementing corrective actions, continuous monitoring ensures that the solution is effective. Monitoring key performance indicators (KPIs) allows teams to spot early warning signs before issues escalate. Metrics should be chosen based on the root cause and its impact on the system. This proactive monitoring will help catch any recurring issues or new problems early on, ensuring sustained improvements over time.
RCA isn’t just for when things go wrong — it’s very valuable when things go right. Performing RCA on successful outcomes can help your team understand the underlying factors that contributed to the success, allowing you to replicate it across other areas. Here's why this is important:
To initiate RCA, you first need to recognize a problem. You can surface issues through:
Incorporating RCA into your workflow requires a structured approach, including selecting appropriate tools and methodologies that suit your organization's needs.
(Splunk can help your organization with RCA, with our industry-leading line of monitoring and observability solutions. Explore Splunk products and solutions.)
Root cause analysis is an essential process for uncovering why something went wrong — or why something worked well — in your infrastructure, whether that's the technology, people, or processes. Establishing an effective RCA process takes time and effort, but it'll pay off in more accurate and lasting problem resolution and create the conditions needed for your infrastructure to perform its best.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.