Virtual Hudson Build System: The Rest of The Story

Part 2

Summary:

The second half of this Hudson-adoption case study sees the team working through some challenges and setbacks. Do they meet their goals? Find out how this virtualization journey ends.

In part 1 of this article (Virtual Hudson Continuous Build Environments: Out with the Old) I described the trials and tribulations of our Hudson build environment at my workplace. This environment started out as a simple system that could build and test our code in a few minutes. Over the years, the build time increased until we had to wait far too long for feedback from the system, and I wanted to solve this problem by trying a pool of virtualized build servers.

We have been using server virtualization around the office for about three years now. We’ve even had some virtualized servers in our production environment. This technology is great and works as advertised.

We decided to buy a single eight-core machine and split it into eight virtual build slaves. On paper, this seemed like a perfect solution to our problem, so it was surprising that we just couldn’t get the money approved for it. Eight core servers (two CPUs with four cores each) are standard and not that expensive right now (about $3,000), especially considering the cost of having highly paid engineers wait for a build. However, this upgrade seemed always to be put on the back burner until the issue happened again.

Here We Go Again
At that point, our main compile build was generating 738 MB of data. This build ran in isolation on the master server, as moving that much data across the wire back to the master from a slave would have added to the build time, which was already at fifteen minutes.

On August 2, the master started to crash. Lisa Crispin, our tester, sent an email to the team at 8 p.m. that said, “Hudson just start freaking out.” Our main Linux guy responded, “The server is seriously ill,” and included the following log information:

Aug 2 19:57:19 <syslog.err> hudson syslogd: /var/log/messages: Read-only file system
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid: aborting-216341995 cmd=2a <c=2 t=0 l=0>
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid abort: 216341995:19[255:128], fw owner
Aug 2 19:57:21 <kern.warn> hudson kernel: megaraid mbox: critical hardware error!
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 <kern.err> hudson kernel: sd 0:2:0:0: timing out command, waited 360s
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error
Aug 2 19:57:24 <kern.err> hudson kernel: sd 0:2:0:0: rejecting I/O to offline device
Aug 2 19:57:24 <kern.crit> hudson kernel: EXT3-fs error (device dm-0): ext3_find_entry: reading directory #15958056 offset 0

I read the emails and knew we had just lost the disks. The thing was RAID 5 hardware, but it was no use. In the morning, our Unix guru tried to restart the box, but it did not work—the controller (Dell PERC 4) just started to reinitialize the drives. We had officially lost our entire configuration.

We had an old Dell PE 850 powered off in the rack, and I decided to rebuild on that while the rest of the team was sharpening the pitchforks. It took about a day just to get the compile build back working again. This was a slower machine, so the build time went up to seventeen minutes, but at least the team put the pitchforks away.

Time to Implement Something New
It took a long time to rebuild everything and, at the same time, we had some major software architecture changes that made it hard to determine whether a build was failing because of a new Hudson configuration issue or because of our code changes.

The good news was that this failure prompted management to approve not only our original request but also a new Hudson master to replace the failed box. After some debates and a lot of planning, we decided to make everything virtual—even the master—in order to guard against another hardware failure that we knew would happen at some time in the future. If the system crashed again, any virtual machines (VMs) on the crashed box could migrate to the working box. If we did this correctly, we would no longer have any downtime due to hardware failures.

The Dawn of a New Generation
Before I could commit 100 percent to the virtualization path, I needed the performance data to back up the decision. Recall from part one that our precrash Hudson server could do the compile in fifteen minutes; the old, postcrash server could do it in seventeen. But, I needed to know what overhead virtualization caused. The following table shows the results of my performance testing:

Server	Time
HUDSON Precrash	15 minutes
HUDSON Postcrash	17 minutes
New Eight-core Server (Non Virtualized)	10 minutes
New Eight-core Server (Virtualized)	12 minutes
New Eight-core Server (Virtualized with iSCSI SAN for VM Storage)	13 minutes

The holy grail of virtualization, at least in my mind, is to be able to move a VM from server to server without stopping the VM. To do this, you need some sort of shared storage between virtualized hosts. The last entry on the table above is a virtualized host running its VMs on an iSCSI SAN. Considering what we gain with that, thirteen minutes is an awesome feat. The overhead of virtualization is well worth it. We will be able to further decrease our build time by parallelizing the builds even more, and adding capacity is pretty simple, too, via the addition of more virtual hosts.

Conclusion
We didn’t make our seven-minute build time goal, and I’m not sure we will ever see that short a time again. We probably could if we hadn’t virtualized any of the build servers, but that is a price we are willing to pay to have a more reliable build system. Overall, our build will be faster, as our queue should not be that deep anymore.

This solution is very effective at getting every single ounce of capacity out of a server (the bosses will like that). Even though we didn’t spend a lot of money on this system and it doesn’t have the fastest servers on the block, it is what we have for now and it works well.

Read Virtual Hudson Build Environments, Part 1.

Tags:

build

build automation

continuous integration

deployment

hudson

virtualization

About the author

Tony Sweets

Tony Sweets is a 15-year veteran in the software industry. For the past 10 years he has been working exclusively on Java enterprise web applications in the financial sector. Tony possesses a wide range of skills, but likes to work mostly to work on Java Applications and the tools that make the development process better. Tony’s blog is located at www.beer30.org/.

Apr 28	STAREAST Software Testing Conference in Orlando & Online
Jun 02	AI Con USA Bridging Minds and Machines
Sep 22	STARWEST Software Testing Conference in Anaheim & Online
Oct 13	Agile + DevOps USA The Conference for Agile and DevOps Professionals

May 09	Beyond Compliance: Ensuring Accessibility with Innovative Testing Strategies
May 16	Test Ahead of the Curve: Insights for Developing a Superior Test Coverage
May 23	How Generative AI Boosts Speed and Quality in Software Testing
On Demand	Building Confidence in Your Automation
On Demand	Leveraging Open Source Tools for DevSecOps