In his Behaviorally Speaking series, Bob Aiello discusses hands-on software configuration management best practices within the context of organizational and group behavior.
Environment management is an essential function in any complex, mission-critical system. Unfortunately, environment management is often overlooked and, even when addressed, usually only handled in the simplest way. Keeping an eye on your environment is actually one of the most important functions for IT operations. If you spend the time understanding what really needs to be monitored and establish effective ways of communicating events, then your systems will be much more reliable—and you will likely get a lot more sleep without so many of those painful calls in the middle of the night. Here's how to get started with environment management.
The ITIL v3 framework provides pretty good guidance on how to implement an effective environment management function. The first step is to identify which events should be monitored and establish an automated framework for communicating the information to the stakeholders who are responsible for addressing problems when they occur. The most obvious environment dependencies are basic resources such as available memory, disk space, and processor capacity. If you are running low on memory, disk space, or any other physical resource, then obviously your IT services may be adversely impacted. Most organizations understand that employees need to monitor key processes and identify and respond to abnormal process termination. Nagios is one of the popular tools to monitor processes and communicate events that may be related to processes being terminated unexpectedly.
There are many other environmental dependencies, such as ports being opened, that also need to be monitored on a constant basis. I have seen production outages caused by a security group closing a port because there was no record that the port was needed for a particular application. These are fairly obvious dependencies, and most IT shops are well aware of these requirements. But what about the more subtle environment dependencies that need to be addressed?
I have seen situations where databases stopped working because the user account used by the application to access the database locked up. Upon investigation, we found that the UAT user account was the same account used in production. In most ways, you want UAT and production to match, but in this case locking up the user account in UAT took down production. You certainly don't want to use the same account for both UAT and production, and it may be a good idea to set up a job that checks to ensure that the database account is always working.
Market data feeds are another example of an environment dependency that may impact your system. This one can be tricky because you may not have control over a third-party vendor who supplies you with data. This is all the more reason why you want to monitor your data feeds and notify the appropriate support people if there is a problem. Cloud-based services may also provide some challenges because you may not always be in control of the environment and might have to rely on a third party for support. Establishing a service-level agreement (SLA) is fundamental when you are dependent on another organization for services. You may also find yourself trying to figure out how your cloud-based resources actually work and what you need to do when your service provider makes changes that may be unexpected and not completely understood. I had this experience myself when trying to puzzle my way through all of the options for the Amazon Cloud. In fact, it took me a few tries to figure out how to turn off all of the billable options such as storage and fixed IPs when the project was over. I am not intending to criticize Amazon per se but even its own help desk had trouble locating what I needed to remove so that I would stop getting charged for resources that I wasn't using.
To be successful with environment management, you need to establish a knowledge base to gather the essential technical information that may be understood by a few people on the team. Documenting and communicating this information is an important task and often requires effective collaboration among your development, data security, and operations teams.