Some programs must handle network errors, file system errors, and the like. Testing their error handling manually can be tedious and time consuming. Relying on accidental errors is unreliable and uncontrollable. Learn about a method for simulating errors that makes the process automated and flexible.
Grasping the complexity of modern software products—layered on multiple operating systems, databases, runtime libraries, third-party products, and various hardware platforms—is akin to the job of astronomers who attempt to grapple with all the known matter in the universe. The limitless nature of the universe is comparable to the states entered by complex software products like operating systems that run on millions of PCs. The problems only increase in distributed software, where algorithms are spread over multiple computers.
We encountered such complexity while testing Mangosoft's Medley. At half a million lines of code, Medley is a distributed file system that can be spread over twenty-five PCs running Microsoft’s Windows 95, Windows 98, or Windows NT operating systems. While the Medley drive looks like a local PC drive, it is really a distributed, peer-to-peer LAN disk. Files on the distributed LAN drive migrate to the local disk of the user who accesses them the most. The product also has data mirroring built in to eliminate single points of failure that are common in most client/server solutions.
In such a product, the bulk of the code is devoted to error handling. A file's data can reside anywhere on the twenty-five PCs, but Medley must provide the highest of data integrity guarantees in the face of different operating systems, device errors, and PCs that randomly crash or shut down.
Testing such guarantees can be done by causing faults, such as a machine crash or network outage, and checking that the product behaves as required. But doing this manually is tedious and time consuming. In the case of a distributed product that relies on network communication, there are just so many times one will pull out the network cable between two machines to verify that the resulting communication failure is handled appropriately. To test disk full handling, it takes time to completely fill a disk, especially with today’s large disk sizes. And what is tedious and time consuming is less likely to be done thoroughly in every release cycle.
Yet if important faults are not automatically inserted, then exercising error handling code is often left to accidental failures that randomly occur in the course of testing. For instance, if disk full faults are not actively inserted, qualification of a disk subsystem must rely on disks accidentally filling up in the course of testing. Another example is buggy mainline code that causes a system to crash. This forces all the remaining systems participating in the Medley drive to acknowledge the loss of the crashed system and to failover to a drive state comprising information on the remaining systems. Medley devotes a lot of code to such failover operations. But as the mainline code stabilizes and fewer failures occur, this source of accidental faults dries up. It’s easy to see that such faults are unreliable and uncontrollable.
Yet the reality today is that many projects rely solely on these haphazard approaches to testing the bulk of their code!
The risk of this traditional approach is that with fewer "accidental" errors to exercise the error handling code paths, problem reports tend to trail off. This leads to the illusion of quality, the fatally attractive notion that since mainline code paths are exhibiting a level of stability and quality, error handling code paths likewise have this same level of quality and maturity. Unless the tester can make plain the level of quality over the whole product, there can be tremendous time-to-market pressure to ship software once the problem reports trail off.
What was our solution to this testing problem? Targeted software fault insertion .