Making Sense of Root Cause Analysis


If we are lucky enough to identify a common process failure related to a specific failure mode, then RCA will have a benefit. This leads us to identifying a common cause for multiple failures, which is the third item in the tabled list. By systematically analyzing multiple failures, patterns of common cause may be identified, leading to a single fix in a requirements, design, or coding process that eliminates multiple faults with one change. A secondary impact of this item is that RCA of single failures is self-defeating, as patterns will not be apparent until multiple failures are analyzed and common causes identified. If you go back to one of the original papers on Defect Prevention and search for "Defect Prevention"), you'll find that the RCA process involves collecting data from multiple failures and analyzing them as a group.

The second item, intellectual vs. physical, is one of the reasons the first and fourth items present their difficulties. Metal fatigue, for example, can be attributed to specific causes that, once eliminated, ensure these failures will not be repeated. The human mind, however, is not so accommodating. If we look at some of the reasons errors get into software, such as communications loss, noisy work environment, multi-tasking impact on short term memory, etc., we need to address the sociological aspects of our profession rather than mechanical or chemical aspects. How many of us work an entire eight-hour day without interruptions? How often do you start a one-hour task in the morning and find that at day's end you have not finished it—and the next day, as you start over, you've forgotten a critical aspect of the program that was your next task the day before.

Item six has multiple consequences. To be effective in software development, the real root cause usually requires the person making the mistake to be involved in the analysis. If there is fear of retribution (scapegoating), the incentive to identify the root cause is eliminated. The second issue is the time relationship between discovery of the root cause and the chance to prevent the problem in the next development cycle.

I came across an organization that was doing RCA on production failures with the expectation of significantly improving quality. Their typical release schedules were twelve to eighteen months. This meant problems found in the requirements or early design activities and eliminated from the next release cycle would have a twelve- to eighteen-month delay before showing up as improved quality in the next release—not what they wanted. To be effective, the time delay between the error introduction, discovery, root cause analysis, repeat of the activity that introduces the error, and impact on the next development or production cycle should be as short as possible.

The third issue with item six is "what to do with the result" of the RCA. For major catastrophes, finding the scapegoat is often the real reason behind RCA, as lawyers and victims get in line for compensation. In software, we are looking to prevent future occurrences, which means we need to change something: process, development environment, work environment, etc. Change means some effort or cost will be incurred to make the change. If this cost is not budgeted, how will it happen? All too often this falls into the "now a miracle occurs" part of the plan—or lack of plan—for preventive action. Whether the preventive action is as simple as a checklist update or complex as changing the development environment or process, allocating some budget for this endeavor is mandatory. Telling your development teams to do something in zero time or at zero cost sends a message that the activity isn't worth much. 

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.