Payson Hall wasn’t always a project management nerd. Long ago, he was an operating systems nerd. Here, he shares lessons learned from optimizing data system performance and gives insights into how they might relate to human team management and leadership.
Early in my career, I was part of a team responsible for optimizing computer system performance for an insurance company mainframe. We had instrumentation about several aspects of system behavior, but the bottom line was throughput—we had to be able to do a day’s worth of processing in less than twenty-four hours or we had a “bathtub problem.” (If water flows into a bathtub faster than it drains out, you have a bathtub problem that will be very messy and expensive if it persists.) These were the days when computers were large and very expensive resources shared by multiple tasks.
There were several knobs available to our team to try to eke out better performance before we had to throw large stacks of cash at bigger engines, more disk space, or more memory. We could try to balance the amount of I/O to any specific device or channel by mixing very high access frequency data on devices with very low frequency access data. We could position files so they were congruent with the geometry of the storage devices. When we had storage devices with different speeds, we could position things like paging data sets or database indices on the highest speed devices with the largest solid-state buffers to decrease I/O wait times. We could reduce virtual memory disk activity by balancing the amount of real memory allocated to certain tasks or routines. We also could adjust the job scheduler parameters that decided the size of the time slices (a few microseconds or a few milliseconds) allocated to computationally intensive tasks that didn’t do much I/O. The goal wasn’t to share resources fairly; it was to maximize throughput with minimal cost, getting as much work done as possible with the equipment on hand.
One of the biggest sources of resource consumption on those systems was something called “task-switching overhead.” This occurred when an active task became blocked because it:
- Consumed the processor resources it was allocated this “turn” (time slice) and had to go to the end of the line and wait for another turn
- Had a page fault (needed to access a virtual memory page that was not memory resident and had to be retrieved from secondary storage)
- Performed an I/O operation that required a wait (for a disk read, tape read, user input, or telecommunication)
When any of these events occurred, the operating system had to store the current state of the now blocked active process so that the process could be restored and restarted later when it was ready to resume. Then the operating system could restore the state of the next ready-to-run process in the queue and that process would pick up where it had left off. This task switching could occur dozens or even hundreds of times per second. Recall that the expensive processor was running orders of magnitude faster than the peripheral devices it was communicating with. Even though task-switching overhead often accounted for a significant share of overall processor consumption, it was cost effective because it maximized throughput for I/O intensive applications and tried to assure that computationally intensive applications didn’t starve everything else.