Recently, we worked on a high-risk, high-visibility system where performance testing ("Let's just make sure it handles the load") was the last item on the agenda. As luck would have it, the system didn't handle the load, and very long days and nights ensued. Delivery was late, several serious disasters were narrowly averted, and large costs were incurred.
It doesn't have to be this way. Nasty end-of-project performance surprises are avoidable. If you've suffered through one or two of these projects and are looking to avoid them in the future, this case study will provide a roadmap.
The Case Study
This case study involves the testing of Internet appliances—simple devices that allow a user to send and receive email, view simple email attachments, and surf the Web. It is not a complex, generalpurpose computer, but rather a simple box project.
On this project, in addition to testing the clients—the appliances themselves—we also needed to test the servers. A key quality risk for the servers was that they might not be able to handle the 25,000 to 40,000 appliances they would have to support.
The project was planned to last about six months. Rather than waiting to see what happened during system test, we created a comprehensive performance strategy that included the following steps:
Most organizations skip steps 1 through 3 and begin 4 just before system installation. That's often the beginning of a sad story. In this case, we went through each step. Since we were responsible for system testing, we’ll focus on the process from that perspective.
1: Static Performance Testing
Because performance was a key issue, the server system architect's design work on the server farm included a model of system behavior under various load levels. This model was a spreadsheet estimating resource utilization level based on a number of Internet appliances running in the field. The estimated resource utilization included CPU, memory, network bandwidth, etc.
After creation, this model underwent peer review by members of the test team and other developers and managers. We made adjustments to the model until we were relatively confident we had a solid starting design.
2: Static Performance Analysis
Based on the spreadsheet model, the server system architect worked with consultants to create a dynamic simulation. These consultants had diverse skills in network and internetwork architecture, programming, and system administration. The simulation was fine-tuned in an iterative process until the architects had sufficient confidence in their predictions. Purchasing hardware and configuring the server farm also were iterative processes. Once confident in a model's prediction of the need for a certain element, we would purchase it.
3: Unit Performance Testing
The system consisted of ten servers providing five key capabilities: e-commerce hosting, email repository, IMAP mail, various user-related databases, and software and operating system updates. Load balancing split each key capability's work across two servers.
Each of these five capabilities underwent unit testing. A unit was defined as a distinct service provided as part of the capability. To carry out this unit testing, the development staff used some internally developed tools to which we added some enhancements. These tools exercised critical and complex components such as update and mail.
4: System Performance Testing
The approach on this project was unique, as static performance testing, static performance analysis, and unit performance testing are not commonly done. Typically, teams start with this step, but much of our system-testing achievement was enabled and supported by this up-front work.
In parallel to the system development and unit performance testing, we planned and prepared for system testing of the servers, including performance testing. As a specialized testing type during the system test phase, performance testing had its own test plan, for which we specified the following main objectives:
In the performance test plan, we described the particular system features we were to performance test—update, mail, Web, and database.
Update: Update servers updated the client. Using the developer's unit test harness, we built a load generating application that could simulate thousands of clients, each by a network socket talking to an update server. The simulated clients would request updates, creating load on the update servers. As an update server received new events or data needed by the client, it would place those in a queue to be delivered at the next update. The update servers could change flow control dynamically based on bottlenecks (hung packets or too many devices requesting updates at once), which also needed to be tested.
Mail: The clients sent and received email via the mail servers, which followed the IMAP standard. We needed to simulate client email traffic, including attachments of various types. Emails varied in size due to message text and attachments. Some emails would go to multiple recipients, some to only one.
Web: When an appliance was sent to a customer, the server farm had to be told to whom it belonged, what servers would receive connections from the client, etc. This process was handled by provisioning software on the Web servers. The Web servers also provided content filtering and Internet access. We simulated realistic levels of activity in these areas.
Database: Information about each appliance was stored in database servers.
This involved various database activities—inserts, updates, and deletes—which we simulated by using SQL scripts to carry out these activities directly on the database servers.
We created a usage scenario or profile and identified our assumptions. We started discussing this plan with developers in the early draft stage. Once we tuned the profiles based on developers’ feedback, we met with management and tuned some more. We corrected assumptions and tracked additions supporting changes to the profile.
Part of our plan included specifying and validating the test environment. The environment used for performance testing is shown in figure 1. Note that we show a pair of each server type—two physical servers running identical applications. Load balancing devices (not shown in the figure) balance the load between servers.
Since this testing occurred prior to the first product release, we were able to use the production server farm as the test environment. This relieved us from what is often a major challenge of performance testing—testing in a "production-like" environment rather than the actual one. As shown in figure 1, we had both client testing appliances and beta testing appliances accessing the servers. In some cases, we would run other functional and nonfunctional tests while the performance tests were executing so that we could get a subjective impression from the testers on what the users' experience would be when working with the clients while the servers were under load.
Preparing System Performance Tests
To run our performance tests, we needed load generators and probes. For the probes to monitor server performance, we used top and vmstat. However, commercial load generation tools were not available for most of the tasks. To avoid enormous expenses and delays associated with creating these test tools, we tried to reuse or adapt other tools wherever possible
Update: We discovered the developers had created a unit test tool for update testing. As luck would have it, the tool was created using TCL, a scripting language we had used on a previous project. We obtained the code, but it needed to be modified for our purposes. The lead developer was intrigued by our approach and plan. Instead of merely handing off the tool to us, he spent a number of days modifying it for us. His modifications made it easier to maintain and added many of the features we needed such as logging, standardized variables, some command line options, and an easier mechanism for multithreading. He even explained the added code to our team and reviewed our documentation. We spent another couple of days adding functionality and documenting further.
We were able to spawn many instances of the tool as simulated clients on a Linux server. These simulated clients would connect to the update servers under test and force updates to occur. We tested the tool by running different load scenarios with various run durations.
Mail: Things went less smoothly in our quest for load generators for the mail server. In theory, a load testing tool existed and was to be provided by the vendor who delivered the base IMAP mail server software. In practice, that didn’t happen, and, as Murphy's Law dictates, the fact that it wouldn’t happen only became clear at the last possible minute.
We cobbled together an email load generator package. It consisted of some TCL scripts that one developer had used for unit testing of the mail server, along with an IMAP client called clm that runs on a QNX server and can retrieve mail from the IMAP mail server. We developed a load script to drive the UNIX sendmail client using TCL.
Web: We recorded scripts for the WebLOAD tool based on feedback from the business users. These scripts represented scenarios most likely to occur in production. To be more consistent with our TCL scripting, we utilized the command line option for this tool. This allowed us to use a master script to kick off the entire suite of tests. We discovered the impact of our tests was not adversely affecting the Web servers. Management informed us that the risk of slow Web page rendering was at the bottom of our priority list. We continued to run these scripts but did not provide detailed analysis of the results.
Database: For the database servers, we wrote SQL scripts that were designed to enter data directly into the database. To enter data through the user interface would have required at least twenty different screens. Since we had to periodically reset the data, it made sense to use scripts interfacing directly with the database. We discussed these details with the developers and the business team. All agreed that using TCL to execute SQL would meet our database objectives and could realistically simulate the real world, even though such scripts might seem quite artificial.
These scripts replicated the following functionality:
We also created some useful utility scripts to accomplish specific tasks:
We used TCL to launch these scripts from a master script.
Putting It Together With The Master Script
The profile called for testing each area of functionality—first individually and then in conjunction with other areas. To interfaces for all of our scripts and created a master script in TCL. The master script allowed for ramping up the number of transactions per second as well as the sustained level. We sustained the test runs for at least twenty-four hours. If 0.20 transactions per second was the target, we did not go from zero to 0.20 immediately but ramped up in a series of steps. If more than one scenario was running, we would ramp up more slowly.
The number of transactions per second to be tested had been derived from an estimated number of concurrent users. The number of concurrent users was in turn derived from an estimated number of clients to be supported in the field. Only a certain percentage of the clients would actually be communicating with the servers at any given moment.
The master test script started top and vmstat on each server. Top and vmstat results were continually written to output log files. At the end of each test run, top and vmstat were stopped. Then a script collected all the log files, deposited them into a single directory, and zipped the files.
We also had TCL scripts to collect top and vmstat results from each server under test. We used perl to parse log files and generate meaningful reports.
Bulletproofing The Tests
We were careful to ensure that the tests were valid and properly executed. We created extensive procedures for setting up and running the various scripts. These procedures saved us a tremendous amount of time and provided quick reference when things fell apart during testing. They also allowed any test team member to start and stop the tests, which was critical because system tests ran virtually 24/7. These procedures helped us get support and buy-in from the developers, for, once they reviewed the procedures, they could be confident that we would not subject them to distracting "bug reports" that were actually false positives.
Performing the Tests and Perfecting System Performance
As system testing started, we went into a tight loop of running the system performance tests, identifying performance problems, getting those problems fixed, and testing the fixes. We found some interesting bugs:
Some of these bugs could be considered "test escapes" from unit testing, but many were the kind that would be difficult to find except in a properly configured, production-like environment subjected to realistic, properly run, production-like test cases.
We encountered problems in two major functional areas: update and mail.
Update: When testing performance, it's reasonable to expect some gradual degradation of response as the workload on a server increases. However, in the case of update, we found a bug that showed up as strange shapes on two charts. Ultimately, the root cause was that update processes sometimes hung. We had found a nasty reliability bug.
The throughput varied from 8 Kbps to 48 Kbps and averaged 20 Kbps. The first piece of the update process had CPU utilization at 33 percent at the 25,000-user load level. The sizes of the updates were consistent but the throughput was widely distributed and in some cases quite slow, as shown in figure 2. The tail in this figure was associated with the bug, where unnaturally large amounts of data would be sent when processes hung.
The second piece of the update process had CPU utilization at 83 percent at the 25,000-user load level. The average session was about 122 seconds. Some sessions lasted quite a bit longer. As shown in figure 3, these longer sessions showed a bimodal distribution of throughput—some quite slow, some about average, and some quite fast.
In our presentation of the test results to management, we asked why the first part of the update was so much faster than the second part and why the update data size was not a factor in throughput in part one but was a factor in part two. We also asked the organization to double check our assumptions about data sizes for each part.
Mail: The mail servers had higher CPU utilization than expected. We recorded 75 percent CPU utilization at 40,000 users. This meant that the organization would need more than thirty servers (not including failover servers) to meet the twelve-month projection of constituents.
Why Did We Do Performance Testing Right?
Not everything went perfectly on this project. We suffered from some "disbelief in the face of bad news" behavior from development and management, in spite of the fact that the details of our tests were well known by developers and, in many cases, the components of the tests had been provided by the developers themselves.
In a truly perfect world, we would have had more time to run performance tests prior to product release. Once the product went live with known performance bugs, we had to repeat the experiments in a special test environment. In some cases, we could not re-create our results in that environment because customers were continuously using the servers (which should be a cautionary tale for all those who try to run performance tests in scaled-down environments).
Taken as a whole, performance testing was done right on this project. The performance testing best practices that we applied to this project include the following:
If most projects with significant performance-related quality risks followed the approach used on this project, many near and outright disasters related to late-discovered performance problems could be avoided—and everyone could count on getting more sleep.
A sad project story
What Would Have Happened Without Steps 1-3?
Suppose we had done no up-front work to create a quality design and test the individual units prior to starting step 4, the system performance testing?
We worked on a project prior to this Internet appliance project, one of similar size and complexity, where the architects and developers skipped steps 1 and 2 and did a poor job of step 3, the unit testing. What unit tests they performed were done in non-representative environments and with invalid usage and load profiles. When time pressures loomed, they decided to curtail unit testing and simply deliver the partially-tested software to us for system testing.
The performance tests revealed a series of fundamental design problems. The software consumed far more memory, CPU resource, and disk space than were available on the servers. These problems were found in an “onion-peeling” fashion, layer by layer, while precious project time slipped by. Since it was too late in the project to re-design the faulty software—or so project management believed—the attempted solutions all involved throwing more hardware at the servers. No sooner was one bottleneck resolved, though, than another one was discovered. Eventually the project failed, and the fact that the high-priced servers could not scale to the desired load levels was a major contributor to that failure.
Here is the content for the StickyNote in PDF format. Click here to open the PDF in a new window.