In May 2006, we wrapped up the largest case study of peer code review ever published, done at Cisco Systems®. The software was MeetingPlace® — Cisco's computer-based audio and video teleconferencing solution. Over 10 months, 50 developers on three continents reviewed every code change before it was checked into version control.
We collected data from 2500 reviews of a total of 3.2 million lines of code. This article summarizes our findings.
How reviews were conducted
The reviews were conducted using Smart Bear Software's Code Collaborator system for tool-assisted peer review. This article is not intended to be a sales pitch for Collaborator, so please see the website for product details.
Cisco wanted to review all changes before they were checked into the version control server, which in their case was Perforce®. They used a Perforce server-side trigger (part of Code Collaborator) to enforce this rule.
Developers were provided with several Code Collaborator tools which allowed them
to upload local changes from a command-line, from a Windows GUI application,
and within the Perforce GUI applications P4Win and P4V.
Reviews were performed using Code Collaborator's web-based user interface:
The Code Collaborator software displayed before/after side-by-side views
of the source code under inspection with differences highlighted in color. Everyone could comment by clicking on a line of code and typing. As shown above, conversations and defects are threaded by file and line number.
Defects were logged like comments but tracked separately by the system for later reporting and to create a defect log automatically. Cisco configured the system to collect severity and type data for each defect.
If defects were found, the author would have to fix the problems and re-upload the files for verification. Only when all reviewers agreed that no more defects existed (and previously found defects were fixed) would be review be complete and the author allowed to check in the changes.
Code Collaborator collected process metrics automatically. Number of lines of code, amount of person-hours spent in the review, and number of defects found were all recorded by the tool (no stopwatch required). Reports were created internally for the group and used externally by Smart Bear to produce the analysis for the case study.
Jumping to the end of the story
The reader will no doubt find it disturbing that, after setting up the parameters for the experiment, we suddenly present conclusions without explanation of statistical methods, the handling of experimental control issues, identifying "defects" that weren't logged as such, and so forth.
The length of this article prevents a proper treatment of the data. The patient reader is referred to Chapter 5 of Best Kept Secrets of Peer Code Review for a detailed account.
Conclusion #1: Don't review too much code at once (<200-400 LOC)
As the chart below indicates, defect density decreased dramatically when the number of lines of code under inspection went above 200:
By "defect density" we mean the number of defects found per amount of code, typically per 1000 lines of code as shown on this graph. Typically you expect at least 50 defects per kLOC for new code, perhaps 20-30 for mature code. Of course these types of "rules" are easily invalidated depending on the language, the type of development, the goals of the software, and so forth. A future article will discuss the interpretation of such metrics in more detail.
In this case, think of defect density as a measure of "review effectiveness." Here's an example to see how this works. Say there are two reviewers looking at the same code. Reviewer #1 finds more defects than reviewer #2. We could say that reviewer #1 was "more effective" than reviewer #2, and the number of defects found is a decent measure of exactly how effective. To