Root cause analysis, or RCA, is the process of identifying the cause of a problem so measures can be taken to prevent that problem from happening again. RCA assumes it’s more effective to resolve problems by addressing the underlying cause rather than just the symptoms.
For a real-world illustration, imagine you notice that your car has consistently low engine oil. You can respond by just adding more oil whenever the levels dip, which will keep your engine lubricated and prevent wear from friction and heat. But you would just be treating the symptom — spending a lot of time and money in the process to keep your oil levels topped off — because the oil would inevitably run low again. Alternatively, you could take the car to a mechanic who could investigate many possible issues — a leak from a bad gasket, high oil consumption due to worn engine components, etc. — to identify the root cause. In this case, getting to the root cause of the problem fixes your engine so you won’t run low on oil again.
Every industry can use RCA, but it’s especially helpful in IT. RCA provides a systematic analysis process to identify problems within complex modern infrastructures accurately and quickly. It can also help with risk management and significantly reduce costs by helping teams identify the root of the problem before they have a domino impact on the system. RCA is so effective that it is mandated in many industries.
In the following sections, we’ll look at how to conduct a root cause analysis, outline principles and best practices to follow, and tell you how to get started with RCA in your IT environment.
Splunk IT Service Intelligence (ITSI) is an AIOps, analytics and IT management solution that helps teams predict incidents before they impact customers.
Using AI and machine learning, ITSI correlates data collected from monitoring sources and delivers a single live view of relevant IT and business services, reducing alert noise and proactively preventing outages.
There is no single way to identify a problem’s root cause, and the process will vary across industries and organizations. In the context of software projects, RCA is usually conducted by a dedicated RCA team composed of personnel who are familiar with the problem and led by an RCA manager. This function is also sometimes called “incident response” and root cause analyses are then conducted as part of a post-incident review.
A basic framework includes the following steps:
The three steps to root cause analysis are contained in a process known as the Six Sigma approach to quality management.
Six Sigma is a popular methodology for making business processes more effective and efficient, aiming to improve quality by finding defects, determining their cause, and improving processes to minimize the variability and increase overall consistency.
Six Sigma uses data-driven analysis methods and systematic approaches to meet improvement goals. One of these is a framework called DMAIC, used to improve existing business processes. Each letter stands for a step in the framework:
In the “analyze” phase, Six Sigma employs five specific types of analyses to promote project goals: source, process, data, resource and communication analysis. Of these, source analysis attempts to find defects using a three-step RCA process:
Six Sigma can be used to improve ITOps and software development processes. Its tools and techniques can help identify the reasons for system failures, high defect rates, missed deadlines or any other problems that impact product quality, system performance and customer satisfaction.
Effective root cause analysis is guided by several core principles, most of which are reflected in the process steps outlined earlier, including the following:
RCA is a holistic approach to problem solving that should strive not just to discover the root cause, but provide enough factual context to suggest effective corrective action.
The Fishbone diagram is a cause-and-effect diagram used to visualize the potential reasons behind a problem that helps determine the root cause. Created in the 1960s by University of Tokyo professor Kaoru Ishikawa, the model is also known as the Ishikawa diagram, and it is considered one of the seven basic quality tools, per the American Society for Quality.
As its name suggests, the diagram depicts a fish skeleton laying on its side. The head, positioned on the right, represents the problem while the ribs extending off its spine represent categories of contributing factors. Bones extending from each of the ribs denote possible causes or causal factors within that category.
The Fishbone diagram follows a four-step process:
For a Fishbone diagram to be effective, follow these best practices:
In addition to the Fishbone diagram, there are a variety of other tools you can use to conduct root cause analysis. Each tool has specific benefits that make it more or less suited to a particular situation. Some of the more popular include:
The 5 Whys: One of the most commonly used tools for conducting an RCA is the 5 Whys method. As the name suggests, it uses the inquisitive approach of young children by encouraging you to repeatedly ask “Why?” after a question is answered to get to the root cause of a problem. It’s called “5 Whys” because it often takes an average of five whys to correctly identify the root of a problem, although it can take more or less depending on the issue. This tool is best used for problems with a single root cause..
To use the 5 Whys technique, follow these steps:
Pareto charts: A Pareto chart is a combined bar and line chart, good for identifying the most significant factors when a problem has multiple causes. Factors are displayed as bars arranged in descending order and a line graph plots cumulative totals of each factor from left to right. In quality control, a Pareto chart is commonly used to identify the most common sources of defects or the most commonly occurring type of defect.
Scatter diagram: A scatter diagram, also called a scatter plot, uses a pair of data points and regression analysis to determine relationships between variables. It’s often used to graphically depict and test multiple potential causes uncovered through Fishbone diagrams or the 5 Whys method to see which ones have an impact on the problem.
To make a scatter diagram, you choose an independent variable (the potential cause) and a dependent variable (the problem). Then you observe the process to gather measurement data that will be used to generate the scatter diagram. When you have your data table, you plot the independent variables on the x-axis and the dependent variables on the y-axis. If the pattern shows a clear line or curve, it indicates there is a positive correlation between the cause and the problem. If the points on the graph form no clear pattern, then there is no correlation between that cause and the problem you’re trying to solve.
Some best practices for root cause analysis include:
After completing a root cause analysis, the final step is to implement preventive action. This involves determining what documents should be updated, which processes need to be modified, who needs new or re-training, and other considerations. Much of these will be determined by the RCA. The goal is preventative action that will ensure the resolved problem never reoccurs
Root cause analysis is essentially a form of problem solving, so to get started you first have to know that there’s a problem. Fortunately, developers and ITOps teams already have a few ways of surfacing issues in place:
Each of these can alert you to infrastructure issues and provide the data you need to perform a systematic root cause analysis. To take advantage of it, you’ll need a tool that can provide real-time visibility into your network, capture that data, and make it make sense to you. These monitoring and observability tools use machine learning to interpret and correlate events from different device logs and reports produced by your infrastructure. Using these insights as part of root cause analysis can help you develop more effective solutions in less time.
Root cause analysis is an essential process for uncovering why something went wrong — and even why something worked well — in your infrastructure. Establishing an effective RCA process takes time and effort, but it will pay off in more accurate and lasting problem resolution and create the conditions needed for your infrastructure to perform its best.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.