"But it worked okay on my machine!" How often have you heard that refrain? And how do you go about solving the mystery of why software works on one machine, but not on others? Let me tell you a story about an incident I experienced. It may help you crack, or better yet, prevent your own hard-to-solve case.
Immediately following an upgrade, a company I recently worked for started getting reports of strange behavior from a deployment test site in Asia. The site's system had suddenly started crashing mysteriously, a problem that could not be reproduced at any other customer deployment test sites. The crashes happened frequently (sometimes daily, sometimes every few days), and the issue was putting the entire project at risk. The company began investigating. It didn't take too long to find the culprit.
Those Dirty Deployment Machines
The deployment machine at the Asian site consisted of a pure–Java server application running on a Windows machine with an installed web server. It also had a client application, several instances of which could be connected to the server at any given time. These clients polled the server at frequent intervals for updated information. The release that caused the problem created the connection slightly differently than before. (Aha! A clue.)
The server was also running a number of background applications, including an Intrusion Detection System (IDS). Apparently the IDS was interpreting the polling requests to the server application as a possible "SYN Attack" (a Denial of Service attack caused by unacknowledged connection requests. In turn, this led to over-allocation of buffer space) and shutting down the server machine. Furthermore, the IDS on this machine (for some reason unknown to the Sys Admin staff) had been configured differently from the other deployment test site machines—it was configured to shut the machine down when an intrusion attempt was detected, rather than just alert the network management center. The company had found the problem.
The growing number of background applications (IDS, virus checkers, firewalls, monitoring software, IM clients, PDA sync apps, device drivers, etc.) on so-called "clean" deployment machines present a risk to successful installation and commissioning of projects, especially as the customers themselves are sometimes only barely aware of their existence and configuration.
Untangling the Web
Often, interactions are not simple clashes between just two applications. A major tool vendor recently had a problem with a new product: selecting a certain configuration option caused their application to hang. After some investigation, it was found to be caused by a three-way interaction between their software, a video card on the test machine, and the version of the Java Virtual Machine (JVM) running on the test machine. Upgrading the JVM fixed the bug, but the problem illustrates how involved these interactions can sometimes get. The more complex they are, the trickier they are to resolve (especially if the customer is looking over your shoulder), and the more urgent the need for a methodical approach to the whole issue of application and driver interactions on deployment platforms.
A pharmaceutical company must determine what interactions may exist between their new product and existing drugs in the marketplace. Similarly, the QA Manager should determine what applications might exist in the deployment environment, how they are configured, and which tests could detect how those applications might interfere with his own product's operation. (At least if he wants to get a bonus when the customer signs off on time.)
Interaction-Proofing Your Test Sites
Virus checkers are dormant much of the time, but if your application involves scheduling CPU-intensive tasks at regular intervals, try to make sure these intervals do not overlap with the schedule the virus checker is on. Similarly, inadvertently giving one of your system files the same (or similar) name as a known virus could cause installation problems. This problem used to be avoided by instructing the user to "disable all virus checkers before installing"; however, in the corporate environment there is decreasing tolerance of this approach.
For IDS, it's a bit more difficult, as a lot depends on individual configurations. However, these systems do have systematic patterns and methods they use for detecting intrusive behavior, as well as deterministic methods for dealing with the "intruder." Try to determine if there are any scenarios in which your application may appear to the IDS to be acting like an "intruder." If possible, change your application’s behavior to make it appear less threatening. If adequate changes are not possible, ensure that the customer (and the deployment engineer) is made aware of these potential problems. You could even suggest that the IDS be reconfigured to be less "sensitive," but be warned—suggesting this course could put you head-to-head with your customer's IT security people, a battle you will be unlikely to win. Always expect that your system will have to make the concessions.
Some software has "squatter's rights." For example, IM clients will typically use a port that they have absolutely no right to use. They could use others, but because IM clients are now so ubiquitous, having your software also use this port is asking for trouble—the IM client is always going to win. Expect IM clients to always be on whatever port they're using, and stay away from it. Don't expect your customer to be too forthcoming with a list of the ports they do or don't use either. Select high-numbered ports for your application to use, and make allowances for the fact that you may have to change them at installation.
Network management applications rarely cause interaction problems, but it's good to know if they are there. You may inadvertently discover that your application needs to be managed by them.
If your budget permits, you could set up two test environments. The first should be a completely "clean" configuration that you can use to baseline your software and ensure objectively that the features and functionality work when there is no possibility of interference. The second should mimic the customer's configuration. If you have to support more configurations than you have available test machines, use Ghost or a similar imaging utility to create a library of customer configurations. Regression test each new build on each configuration, and keep track of test case results in each case. You’ll probably find a few surprises.
Who Dunnit, I Mean, Who Does It?
So who goes out and gathers all this information? If you're like many QA Managers, you don't get invited to customer meetings and can't directly trawl for information yourself. Instead, enlist the help of people in technical support for projects involving existing customers. It's in their interest to help you, as it reduces the likelihood that they'll have to deal with these problems somewhere down the line.
In the case of an initial installation for a new customer, ask the field sales engineers to help you out. Supply them with a checklist of configuration items that they can look for over the course of several visits, with the customer's consent, of course. Also try to touch base with any technically-minded marketing people; it's possible they may have an insight into the customer's platform configuration.
Gathering and utilizing deployment platform information neatly augments some of the "lean" software development practices, especially the concept of "doing it right the first time" by incorporating feedback—from the customer in this case. Once you've gone through this exercise a few times, expect your deployment and support costs to drop significantly. And don't forget to collect your bonus.