Targeted Fault Insertion

A method for testing error handling code paths
Better Software Magazine
Volume-Issue: 
2002-03
Summary:

Some programs must handle network errors, file system errors, and the like. Testing their error handling manually can be tedious and time consuming. Relying on accidental errors is unreliable and uncontrollable. Learn about a method for simulating errors that makes the process automated and flexible.

Grasping the complexity of modern software products—layered on multiple operating systems, databases, runtime libraries, third-party products, and various hardware platforms—is akin to the job of astronomers who attempt to grapple with all the known matter in the universe. The limitless nature of the universe is comparable to the states entered by complex software products like operating systems that run on millions of PCs. The problems only increase in distributed software, where algorithms are spread over multiple computers.

We encountered such complexity while testing Mangosoft's Medley. At half a million lines of code, Medley is a distributed file system that can be spread over twenty-five PCs running Microsoft’s Windows 95, Windows 98, or Windows NT operating systems. While the Medley drive looks like a local PC drive, it is really a distributed, peer-to-peer LAN disk. Files on the distributed LAN drive migrate to the local disk of the user who accesses them the most. The product also has data mirroring built in to eliminate single points of failure that are common in most client/server solutions.

In such a product, the bulk of the code is devoted to error handling. A file's data can reside anywhere on the twenty-five PCs, but Medley must provide the highest of data integrity guarantees in the face of different operating systems, device errors, and PCs that randomly crash or shut down.

Testing such guarantees can be done by causing faults, such as a machine crash or network outage, and checking that the product behaves as required. But doing this manually is tedious and time consuming. In the case of a distributed product that relies on network communication, there are just so many times one will pull out the network cable between two machines to verify that the resulting communication failure is handled appropriately. To test disk full handling, it takes time to completely fill a disk, especially with today’s large disk sizes. And what is tedious and time consuming is less likely to be done thoroughly in every release cycle.

Yet if important faults are not automatically inserted, then exercising error handling code is often left to accidental failures that randomly occur in the course of testing. For instance, if disk full faults are not actively inserted, qualification of a disk subsystem must rely on disks accidentally filling up in the course of testing. Another example is buggy mainline code that causes a system to crash. This forces all the remaining systems participating in the Medley drive to acknowledge the loss of the crashed system and to failover to a drive state comprising information on the remaining systems. Medley devotes a lot of code to such failover operations. But as the mainline code stabilizes and fewer failures occur, this source of accidental faults dries up. It’s easy to see that such faults are unreliable and uncontrollable.

Yet the reality today is that many projects rely solely on these haphazard approaches to testing the bulk of their code!

The risk of this traditional approach is that with fewer "accidental" errors to exercise the error handling code paths, problem reports tend to trail off. This leads to the illusion of quality, the fatally attractive notion that since mainline code paths are exhibiting a level of stability and quality, error handling code paths likewise have this same level of quality and maturity. Unless the tester can make plain the level of quality over the whole product, there can be tremendous time-to-market pressure to ship software once the problem reports trail off.

What was our solution to this testing problem? Targeted software fault insertion .

About the author

Paul Houlihan's picture
Paul Houlihan

Paul Houlihan has been studying for several years how to remove defects cost-effectively from code, especially for distributed algorithms. For the past 5 years, Paul has worked at Mangosoft Incorporated as a principal engineer in the QA group. Mangosoft specializes in peer-to-peer software running in kernel as well as user mode. Mangosoft's has two premier products. The first is CacheLink, which is a LAN based web accelerator that caches web pages and makes them available to other users on the LAN. The other is Mangomind, a shared Internet drive that appears as a local drive on your Windows desktop. With extensive caching for good performance, Mangomind makes collaboration over the Internet as simple as accessing a local drive. While it has been a challenge to automate the testing of these diverse and complicated products,very high quality is essential for distributed products, indeed, it is a key differentiator. Paul has contributed to the extensive test automation infrastructure that in one year alone ran 56,000 tests in Mangosoft's automated unattended 24-by-7 lab, filing 3686 problem reports. Prior to that Paul was a principal engineer in the OpenVMS cluster group. Layered on the OpenVMS operating system, VMSClusters are a distributed work environment spread over up to 96 computers, spanning multiple interconnects and computer and disk architectures.

Upcoming Events