Virtual Hudson Build System: The Rest of The Story

Part 2

The second half of this Hudson-adoption case study sees the team working through some challenges and setbacks. Do they meet their goals? Find out how this virtualization journey ends.

In part 1 of this article (Virtual Hudson Continuous Build Environments: Out with the Old) I described the trials and tribulations of our Hudson build environment at my workplace. This environment started out as a simple system that could build and test our code in a few minutes. Over the years, the build time increased until we had to wait far too long for feedback from the system, and I wanted to solve this problem by trying a pool of virtualized build servers.

We have been using server virtualization around the office for about three years now. We’ve even had some virtualized servers in our production environment. This technology is great and works as advertised.

We decided to buy a single eight-core machine and split it into eight virtual build slaves. On paper, this seemed like a perfect solution to our problem, so it was surprising that we just couldn’t get the money approved for it. Eight core servers (two CPUs with four cores each) are standard and not that expensive right now (about $3,000), especially considering the cost of having highly paid engineers wait for a build. However, this upgrade seemed always to be put on the back burner until the issue happened again.

Here We Go Again
At that point, our main compile build was generating 738 MB of data. This build ran in isolation on the master server, as moving that much data across the wire back to the master from a slave would have added to the build time, which was already at fifteen minutes.

On August 2, the master started to crash. Lisa Crispin, our tester, sent an email to the team at 8 p.m. that said, “Hudson just start freaking out.” Our main Linux guy responded, “The server is seriously ill,” and included the following log information:

Aug 2 19:57:19 <syslog.err> hudson syslogd: /var/log/messages: Read-only file system  
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid: aborting-216341995 cmd=2a <c=2 t=0 l=0>  
Aug 2 19:57:19 <kern.warn> hudson kernel: megaraid abort: 216341995:19[255:128], fw owner  
Aug 2 19:57:21 <kern.warn> hudson kernel: megaraid mbox: critical hardware error!  
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset  
Aug 2 19:57:21 <kern.notice> hudson kernel: megaraid: hw error, cannot reset  
Aug 2 19:57:21 <kern.err> hudson kernel: sd 0:2:0:0: timing out command, waited 360s  
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error  
Aug 2 19:57:24 <kern.emerg> hudson kernel: journal commit I/O error  
Aug 2 19:57:24 <kern.err> hudson kernel: sd 0:2:0:0: rejecting I/O to offline device  
Aug 2 19:57:24 <kern.crit> hudson kernel: EXT3-fs error (device dm-0): ext3_find_entry: reading directory #15958056 offset 0  

I read the emails and knew we had just lost the disks. The thing was RAID 5 hardware, but it was no use. In the morning, our Unix guru tried to restart the box, but it did not work—the controller (Dell PERC 4) just started to reinitialize the drives. We had officially lost our entire configuration.

We had an old Dell PE 850 powered off in the rack, and I decided to rebuild on that while the rest of the team was sharpening the pitchforks. It took about a day just to get the compile build back working again. This was a slower machine, so the build time went up to seventeen minutes, but at least the team put the pitchforks away.

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.