Perhaps the most respected book on measurement in computing, Fenton and Pfleeger's Software Metrics: A Rigorous and Practical Approach, defines measurement as follows:
"Measurement is the process by which numbers or symbols are assigned to attributes of entities in the real world in such a way as to describe them according to clearly defined rules."
The problem with this definition is that there are lots of available clear-cut rules. We have to select the right clearly defined rule. Otherwise we could measure "goodness of testers" by the clearly defined rule called "counting their bug reports." This (as we'll see in a few moments) is ridiculous.
I prefer the following definition:
Measurement is the assignment of numbers to attributes of objects or events according to a rule derived from a model or theory.
Fenton and Pfleeger do point out the issue of the need for a model. They discuss this as an issue of the definition of the attribute. All that I'm doing here is making the issue much more visible.
The invisibility of underlying measurement models has led people to use inadequate and inappropriate "metrics," deluding themselves and wreaking havoc on their staffs. For a good read on this as a general problem, read R.D. Austin's 1996 book, Measuring and Managing Performance in Organizations-or, for that matter, Scott Adams' comic strip
Building a Theory of Measurement
Measurement theory addresses problems that run through many disciplines, including Computing. I learned about the theory of measurement primarily from Steve Link and A.B. Kristofferson, when I did my doctoral studies in psychophysics (also known as perceptual measurement). (For a thoughtful history of that field, read Link's 1992 book, The Wave Theory of Difference and Similarity.)
This article is a preliminary report of my attempts to pull together thinking from several disciplines into a more coherent, and I think more practical, approach to software-related measurement. My goal is to help you evaluate measurement schemes that people ask you to use, to help you explain why the bad ones shouldn't be imposed on your group, and to help you develop more useful alternatives.
In summary, I think that the theory underlying a measurement must take into account at least nine factors. This article defines these nine factors and applies them to a few examples.
The first five are intuitive:
The next four factors are more technical but are essential for an understanding of the attribute, the instrument, and their relationship:
Example: Using a Ruler
Let's start with an example of the simplest case, measuring the length of a table with a one-foot ruler.
Example: The Scaling Problem
Suppose that Sandy, Joe, and Susan run in a race. Sandy comes in first, Joe comes in second, and Susan third. The race comes with prize money. Sandy gets $10,000, Joe gets $1000, and Susan gets $100.
The final scale to mention is the absolute scale. If you have one (1) pen, you have one (1) pen. If you cut it in half, you get a mess; not two halves of a working pen.
Measurement in the Real World
The examples of length and position in the race are toys. They are easy to figure out. The theories of relationship are clear-cut and the side effects are minimal.
When it comes to things that we would really like to measure, life is more difficult. Examples of the kinds of things that testers are routinely asked about include:
How do we measure these? Each one involves complex issues. Typically, they involve a lot of judgment (which is subjective). Additionally, several of the most interesting dimensions involve human behavior. That's hardly a surprise-we are working in a field of human endeavor, called computing, whose essential work product is the stuff of mental creation. The essence of "quality" is "qualitative." As Gerald Weinberg wrote in the 1993 book Quality Software Management, "Quality is value to some person."
There is a bias in computing against measurements that involve subjective quantities. Somehow, some people have developed the idea that subjective issues are immeasurable and unscientific. (This article isn't the place to refute that directly; but if you're of that view, go read Link's wave theory book.)
Tom DeMarco provided an example of this bias almost twenty years ago. Although he has taken a different approach in his more recent book, Why Does Software Cost So Much?, his original presentation in 1982's Controlling Software Projects is a well-written, still influential example. DeMarco asked how to measure customer interface complexity, and provided an example of how not to do it-the development team were asked to rate the customer interface complexity of their own projects as normal, greater than normal, or less than normal. He pointed out some biases associated with using developers to
rate their own code, and then concluded that "any exercise that tries to give a numeric value to an unquantum without doing any real measurement along the way is a bit of a fraud." An "unquantum" is, according to DeMarco, "a relevant factor that is unmeasured." Evidently, rankings by humans don't count as measurements. Instead, he said that measures like the following were "true metrics."
I accept DeMarco's opinion that the developers' rankings of their own work are unusably biased, but to say on the basis of this that human ranking of complexity is some kind of unquantum because it isn't expressed in easy-to-count numbers that are much less directly related to the value we want to measure (that is, the complexity of the thing to humans) seems…well, I guess we just disagree [on that 1982 conclusion]. I think it would be interesting to ask customers who interacted with the system to rate the different areas' customer interface complexity. The fact that there are lots of ways to do this badly doesn't create an excuse for walking away from a fundamental point: that if you want to talk about the complexity of a human-machine interface, the human's sense of that complexity is a key measure-perhaps the most direct and the important measure-of that complexity.
Software-related attributes often involve psychological or subjective components. Our measurements of them are questionable when they fail to take these factors into account.
Let's consider three common attempts to develop software metrics:
Example: Bug Counts and the Theory of Relationship
Should we measure the quality (productivity, efficiency, skill, etc.) of testers by counting how many bugs they find? Leading books on software measurement suggest that we compute "average reported defects/working day" and "tester efficiency" as "number of faults found per KLOC" or "defects found per hour." [See complete references at the end of this article, specifically Grady et al. 1987, Fenton et al. 1997, and Fenton et al. 1994.] These authors are referring to averages, not measures of individual performance, and they sometimes warn against individual results (because they might be unfair). However, I repeatedly run into managers (or, at least, the test managers who work for them) who compute these numbers and take them into account for decisions about raises, promotions, and layoffs. For that matter, are these even valid measures of the efficiency of the group as a whole?
Let's do the analysis and see the problems with this measure:
Problems like these have caused several measurement advocates (specifically Grady and Caswell, and Austin) to warn against measurement of attributes of individuals unless, as DeMarco suggested in 1995, the measurement is being done for the benefit of the individual (for genuine coaching or for discovery of trends) and otherwise kept private.
With only a weak theory of relationship between bug counts and tester goodness, and serious probable side effects, we should not use this measure (instrument).
Example: Code Coverage
Suppose that you want to know how much testing has been done. How would you measure that?
One approach is to compute "code coverage." The most common definition of coverage involves the percentage of statements tested, or the percentage of statements plus branches tested. Supposedly, a higher percentage means more testing. Some people (vendors included) go further and foolishly say that 100 percent coverage means complete, or sufficient, testing.
There are several other types of coverage beyond statement and branch coverage (some examples are described in my 1995 article, listed in the complete references at the end of this feature). Each of these involves measuring the percentage of a certain type of test that you have run, or a certain level of thoroughness of checking for a specific type of error. We are never using the population of all possible tests of a product as our baseline when we compute code coverage-if we were, coverage would always be 0.00 percent, a rather boring number. But because we are not accounting for all possible tests, we can have a 100 percent covered product that still has undiscovered defects.
In sum, our measures of the extent of testing-like so many measures that we take in software-are numeric. This might make them look more "scientific," but they are fundamentally judgment-driven. A theory of testing and of testing adequacy is embedded (often hidden) in such measures.
Example: Code Complexity
McCabe's complexity metric is often enough criticized as incomplete (see, for example, Fenton and Pfleeger in Software Metrics). But let's apply our model to this metric.
In his 1993 book, Making Software Measurement Work: Building an Effective Measurement Model, Bill Hetzel pointed out that
"Practitioner attitudes [toward measurement] tend to range from barely neutral to outright antagonistic. It is rare to find the practitioner who really thinks of measurement as a useful and indispensable tool for good software work. Most feel that they get back very little from the measurement activity. . . The psychological dislike and distrust our practitioners have about measurement is a significant challenge facing us. From my perspective, we've been pretty unsuccessful in serving working engineers and practitioners."
Hetzel suggests an alternative, bottom-up approach to measurement that is worth looking into. He wants to use metrics to stimulate questions, to explore the engineering activity, rather than to use them to immediately focus on setting targets and goals to control engineering.
The approach that I'm suggesting isn't incompatible with Hetzel's. Or with Fenton and Pfleeger's, or many other common approaches to software metrics. But what I'm proposing here is more explicit about some issues that we too often approach too casually.
Measures are not made acceptable simply because they are easy to compute and seem relevant. They are not valuable merely because they have something to do with the latest goal-of-the-week. They work when they actually relate to something we care about, and when the risks associated with taking the measures (the probable side effects), in the context of the scope of use of those measures, are insignificant compared to the value of information we actually obtain from them. To understand that value, we must understand the underlying relationship between the measure and the attribute measured.
This material was first publicly presented at the Pacific Northwest Quality Conference in October, 1999. This model was reviewed and extended at the Eighth Los Altos Workshop on Software Testing in December, 1999. I thank the LAWST attendees, Chris Agrusõ, James Bach, Jaya Carl, Rocky Grober, Payson Hall, Elisabeth Hendrickson, Doug Hoffman, Bob Johnson, Mark Johnson, Brian Lawrence, Brian Marick, Hung Quoc Nguyen, Bret Pettichord, Melora Svoboda, and Scott Vernon, for their critical analyses.
Editors Note: Because this article introduces a new theory into the body of industry literature, we have included its complete list of references.