In his Behaviorally Speaking series, Bob Aiello discusses hands-on software configuration management best practices within the context of organizational and group behavior.
Software impacts our world in many important ways. Almost everything that we touch—from the beginning to the end of our day—relies upon software. For example, airline flight controls and nuclear power plants all rely upon complex software code that must be updated from time to time, tested, and supported.
The New York City Council is currently holding hearings on a recent incident involving the 911 emergency dispatch system, and this is not the first time that such emergency dispatch systems have come under scrutiny. The software that enables the anti-missile defense system known as the Iron Dome in Israel has been credited with saving lives, and underwent an extensive testing and validation effort. But, the number of software glitches impacting trading systems and other complex financial systems could cause us to question whether or not our capability to manage software configuration management is really where it should be.
Many years ago, I was interviewed by a very smart technology manager for a position supporting a major New York-based stock exchange. I went into the interview feeling pretty confident that I had the requisite skills; also, I was recommended by a manager who I had worked for previously at another company. During the interview, I was surprised when I was asked a very pointed question about my capabilities. The manager asked me to imagine that I was supporting the software for a life-support system that my loved-one depended upon. He then asked me if I was confident that I would never make a mistake that could potentially impact the person (presumably my child, parent, or spouse) who was dependent upon the life-support system. I was pretty shocked at this question posed during a job interview. I managed to stay positive, and I told the manager my methods worked and yes I would trust them on a life-support system that could potentially impact someone who I cared about. But the question stayed with me for years to come. The truth is that someone has to upgrade the software used by life-support systems, and I am not completely confident that our industry has completely reliable methods to handle this work.
From a configuration management perspective, the first step in software safety must be to establish the trusted base from the system’s software to applications that are integrated with the hardware devices. The trusted base must start from the lowest levels of the system, including the firmware, operating system, and even the hardware itself. Applications must built, packaged, and deployed deterministically to the trusted base in a manner that ensures that we know exactly what code is to be deployed and that we can verify that the correct code actually was indeed deployed to the target environment. Equally import is verifying that no unauthorized changes have occurred and that the trusted base is verifiable and fully tested. If you had a pacemaker that required software updates, obviously it would be essential that you can rely upon there being a trusted base that enables the pacemaker to function reliably and correctly.
Recent outages at major stock exchanges and trading firms have shown that many complex financial systems obviously do not have an established trusted computing base and that has directly resulted in very steep losses for some firms and impacted thousands of people. The good news is that we actually do know how to build, package, and deploy software reliably. We also know how to verify that the right code was deployed and that there are no unauthorized changes. These best practices are precisely what we discuss in application build, package, and deployment—including DevOps—although many firms struggle with their successful implementation. The key to success is to start from the beginning.
In my consulting work, I often find that companies actually do know what has to be done to reliably build, package, and deploy software successfully. The problem is that they often begin doing the right thing much too late in the application lifecycle. Edward Deming teaches us that quality must be built in from the beginning. The same is especially true when considering software safety.