In his Behaviorally Speaking series, Bob Aiello discusses hands-on software configuration management best practices within the context of organizational and group behavior.
Software impacts our world in many important ways. Almost everything that we touch—from the beginning to the end of our day—relies upon software. For example, airline flight controls and nuclear power plants all rely upon complex software code that must be updated from time to time, tested, and supported.
The New York City Council is currently holding hearings on a recent incident involving the 911 emergency dispatch system, and this is not the first time that such emergency dispatch systems have come under scrutiny. The software that enables the anti-missile defense system known as the Iron Dome in Israel has been credited with saving lives, and underwent an extensive testing and validation effort. But, the number of software glitches impacting trading systems and other complex financial systems could cause us to question whether or not our capability to manage software configuration management is really where it should be.
Many years ago, I was interviewed by a very smart technology manager for a position supporting a major New York-based stock exchange. I went into the interview feeling pretty confident that I had the requisite skills; also, I was recommended by a manager who I had worked for previously at another company. During the interview, I was surprised when I was asked a very pointed question about my capabilities. The manager asked me to imagine that I was supporting the software for a life-support system that my loved-one depended upon. He then asked me if I was confident that I would never make a mistake that could potentially impact the person (presumably my child, parent, or spouse) who was dependent upon the life-support system. I was pretty shocked at this question posed during a job interview. I managed to stay positive, and I told the manager my methods worked and yes I would trust them on a life-support system that could potentially impact someone who I cared about. But the question stayed with me for years to come. The truth is that someone has to upgrade the software used by life-support systems, and I am not completely confident that our industry has completely reliable methods to handle this work.
From a configuration management perspective, the first step in software safety must be to establish the trusted base from the system’s software to applications that are integrated with the hardware devices. The trusted base must start from the lowest levels of the system, including the firmware, operating system, and even the hardware itself. Applications must built, packaged, and deployed deterministically to the trusted base in a manner that ensures that we know exactly what code is to be deployed and that we can verify that the correct code actually was indeed deployed to the target environment. Equally import is verifying that no unauthorized changes have occurred and that the trusted base is verifiable and fully tested. If you had a pacemaker that required software updates, obviously it would be essential that you can rely upon there being a trusted base that enables the pacemaker to function reliably and correctly.
Recent outages at major stock exchanges and trading firms have shown that many complex financial systems obviously do not have an established trusted computing base and that has directly resulted in very steep losses for some firms and impacted thousands of people. The good news is that we actually do know how to build, package, and deploy software reliably. We also know how to verify that the right code was deployed and that there are no unauthorized changes. These best practices are precisely what we discuss in application build, package, and deployment—including DevOps—although many firms struggle with their successful implementation. The key to success is to start from the beginning.
In my consulting work, I often find that companies actually do know what has to be done to reliably build, package, and deploy software successfully. The problem is that they often begin doing the right thing much too late in the application lifecycle. Edward Deming teaches us that quality must be built in from the beginning. The same is especially true when considering software safety.
Successful build and release engineers understand that smoke testing after a deployment is essential for a successful build and release process. When the software matters then you need to be verifying and validating the code from the very beginning to the end of the lifecycle. This means that your build stream should include unit testing—functional and non-functional (e.g. performance testing)—and, of course, comprehensive regression testing. Good configuration management practices allow you to build a version of the code that can be instrumented for comprehensive code analysis and exhaustive automated testing. The truth is that these best practices are most successful when they are supported from the very beginning of the lifecycle and are a fundamental part of the culture of the organization. Don't forget that the build and deploy pipeline must also be verifiable and trusted.
When I create an automated build and deployment system, I start from the ground up verifying the operating system itself and all of the system dependencies. I only trust the trusted base if I am able to verify it on a continuous basis, and this becomes for me part of environment management (and monitoring). For example, the Center for Internet Security (CIS) provides an excellent consensus standard that explains in great detail exactly how to create a secure Linux operating system. You will also find that the consensus standard also provides example code for verifying that the security baseline is configured as it should be. Successful security engineering involves both configuring the operating system correctly and verifying on an ongoing basis that it stays configured in a secure way. This is a fundamentally core aspect of environment monitoring, and is essential for ensuring the trusted base.
Software safety requires that systems be built and configured in a secure and reliable way. Changes need to be tracked and verified, which is essentially the purpose of the physical configuration audit. I hope you will contact me to share your views on software safety best practices and get involved with the community based efforts to update software safety standards!
As I write this article I am preparing for a full-day class at the upcoming The Nuclear Information Technology Strategic Leadership (NITSL) conference. The NITSL is a nuclear industry group of all nuclear generation utilities that exchange information related to information technology management and quality issues. I am also working to recruit technology professionals to help update two of the IEEE industry standards related to software safety (please contact me if you might be interested in serving on one of the standards working groups). I would also like to share some of my thoughts on what we need to do in order to establish suitable procedures to support software safety.