Sometimes, the best answer is to rephrase the question. This was the approach that one of our biggest customers took when undertaking a new effort to improve their Release Engineering process. They first asked themselves: How can we make the process of making products faster, more reliable, and more efficient? It’s worth pausing to understand what the process is today before thinking about improving it. Whether managed by a dedicated team or not, Release Engineering is the part of a software development organization that’s responsible for actually converting the millions of lines of carefully crafted source code into a useful software product or service for the end-user. More interestingly, it must also be able to show definitively what went into a release should (when) the need to modify it arises.
At its simplest, this means when you run the compiler you should take good notes. In the real world, however, RelEng intersects with virtually every aspect of development: it has large touch points with source configuration management, testing, documentation, deployment, support, and product management. The ideal release process is transparent, flexible, efficient, all-encompassing, traceable, scalable, robust, and fast. RelEng teams often maintain dozens of active branches, each on a diverse set of platforms, each requiring tens of thousands of build and test steps to complete. Googling for “Release Engineering Best Practices” and a few other permutations yields no shortage of blogs, articles, and posts chock-full of checklists with plenty of good advice: number every build, always tag your sources, save the binaries, keep track of test outputs, etc. Like any other subject that’s paradoxically both uncharted and ubiquitous, Release Engineering is ripe for internet commentary, and practioners and pundits are happy to advise.
What’s conspicuously absent, however, are suggestions on how to contain the complexity and manage the workload. It’s easy to see why. So much of RelEng is concerned with validation, proof, consistency—it’s responsible for producing the golden master, the only build that actually matters—so no price tag in terabytes, clock cycles or man-hours is too high so long as the data collected was in the name of tracking every possible input and its impact on any possible output.
But what if instead of asking what we can do, we ask what we can do without? This requires a kind of process engineering leap-of-faith, a fundamental belief that speed is more valuable than quality, detail, or precision, simply because without speed you won't be capable of delivering the quality, detail, or precision you're looking for. The idea is not that we would have positively identified issue X sooner—as the guys will happily tell you, no way to catch that without more resources, more tests, more time all yoked to more process—but rather that a substantial chunk of issues we’ve been working on up until X would have been discovered faster. If the ultimate goal of RelEng is to produce the flawless golden master as efficiently as possible, then it follows that we should be racing through the plastic prototypes (read: your nightly builds) that precede it as fast as we can.
What’s one surprising way we can optimize a Release Engineering process? Do the majority of the runs in one, unchanging compute environment. This means eliminating all platforms, architectures, tools, and environments except the most common one. It means carefully defining a box, which I’ll call the One True Machine (OTM), knowing precisely what’s on it, and making instances of that machine as easily accessible to everyone as a blank sheet of paper.
In the old days that may have meant disk-imaging systems, but today the name of the game is virtualization. The vision is simply this: at the stroke of one mouse click, a crisp new system spins up that’s ready to build the product from sources. Want to test your new product? Make another fresh system, and you can install built product and test. Want to install another version? Take another sheet from the OTM pad and have at it. Call it the power of the private clone cloud: not only is it virtual and elastic, every system is identical; the whole environment is only available in one flavor.
The discipline enforced by this restriction is incredibly liberating: whole categories of problems (botched PATHs, conflicting libraries, clashing toolchains) are suddenly impossible. More powerfully: explicitly publishing and promoting the default build/test/release environment makes it a new