A metrics program is any planned activity in which you use measurement to meet some specific goal. If you do not have a clear technical goal for a metrics program, then you are almost certainly not ready for such a program. Here's how to design a measurement program that leads to decisions and actions.
If your organization makes a conscious effort to record information about its software defects, then you may be pleasantly surprised to know that you already have in place the basics of a metrics program. However, it is likely that you are either not making much use of the information or-even worse-getting misleading results from it. First, let's take a look at the use of fault data in data-collection programs.
Starting Point: Distinguishing Between Faults and Failures
Typically, you might see the kind of data shown in the first two or four columns of Table 1, based on a real sample of modules from a major system.
In this case, the development organization was quite rigorous in its approach to recording defects. Every defect discovered during independent testing and in operation was traced to a specific software module. The organization wanted to identify problem modules out of the hundreds in total. The raw defects data (column 2) suggests that modules Q and L are the problem modules. In many cases this is as far as your data will allow you to go. Yet looking a bit deeper reveals a very different story. First, recovering the module size data in thousands of lines of code (KLOC) and taking it into account (columns 3 and 4) immediately explains the problem with module Q. It's big. We might now conclude that A and L are the problem modules because they have the most defects per line of code. However, when the defects are split between those that were discovered by testers pre-release (column 5) and those that were the cause of customer-reported problems post-release, then the picture is completely different. The problem modules post-release are actually C and P.
Now most people who collect fault data prior to release are really interested in using it to predict the number of post-release failures. With our assumptions, Table 1 highlights just how bad the pre-release fault data is at predicting post-release failures at the module level. Generally, there is now very good empirical evidence that pre-release faults are a bad predictor of post-release faults. Figure 1 shows the results of a recent study of a major telecommunication software system. The modules with most faults pre-release generally had very few post-release. Conversely, the genuinely failure-prone modules generally have low numbers of pre-release faults. For this system some 80% of pre-release faults occurred in modules which had NO post-release faults. Similar results have been observed for other systems.
A high number of pre-release faults may simply be explained by good testing (rather than poor quality), and a low number of post-release faults may simply be explained by low operational usage. In fact, the relationship between faults and failures is not at all straightforward. There is very strong empirical evidence that most failures experienced by software systems are caused by a very tiny proportion of the residual faults. Conversely, most residual faults are benign in the sense that they will very rarely lead to failures in operation. This is shown in Figure 2.
In 1983, Ed Adams of IBM published the results of an extended empirical study into the relationship between faults and failures in nine large systems over many years of operation. He found remarkably consistent results between the nine systems. For example, in each case around 34% of the known residual faults led to failures whose mean time to occurrence was over 5,000 years. In practical terms, such failures were probably only ever observed once by a single user (out of many thousands of users over several years). Conversely, the big faults-those which cause the frequent failures