Lessons from Optimizing Computer Systems Applied to Human Teams

[article]

Payson Hall wasn’t always a project management nerd. Long ago, he was an operating systems nerd. Here, he shares lessons learned from optimizing data system performance and gives insights into how they might relate to human team management and leadership.

Early in my career, I was part of a team responsible for optimizing computer system performance for an insurance company mainframe. We had instrumentation about several aspects of system behavior, but the bottom line was throughput—we had to be able to do a day’s worth of processing in less than twenty-four hours or we had a “bathtub problem.” (If water flows into a bathtub faster than it drains out, you have a bathtub problem that will be very messy and expensive if it persists.) These were the days when computers were large and very expensive resources shared by multiple tasks.

There were several knobs available to our team to try to eke out better performance before we had to throw large stacks of cash at bigger engines, more disk space, or more memory. We could try to balance the amount of I/O to any specific device or channel by mixing very high access frequency data on devices with very low frequency access data. We could position files so they were congruent with the geometry of the storage devices. When we had storage devices with different speeds, we could position things like paging data sets or database indices on the highest speed devices with the largest solid-state buffers to decrease I/O wait times. We could reduce virtual memory disk activity by balancing the amount of real memory allocated to certain tasks or routines. We also could adjust the job scheduler parameters that decided the size of the time slices (a few microseconds or a few milliseconds) allocated to computationally intensive tasks that didn’t do much I/O. The goal wasn’t to share resources fairly; it was to maximize throughput with minimal cost, getting as much work done as possible with the equipment on hand.

One of the biggest sources of resource consumption on those systems was something called “task-switching overhead.” This occurred when an active task became blocked because it:

  • Consumed the processor resources it was allocated this “turn” (time slice) and had to go to the end of the line and wait for another turn
  • Had a page fault (needed to access a virtual memory page that was not memory resident and had to be retrieved from secondary storage)
  • Performed an I/O operation that required a wait (for a disk read, tape read, user input, or telecommunication)

When any of these events occurred, the operating system had to store the current state of the now blocked active process so that the process could be restored and restarted later when it was ready to resume. Then the operating system could restore the state of the next ready-to-run process in the queue and that process would pick up where it had left off. This task switching could occur dozens or even hundreds of times per second. Recall that the expensive processor was running orders of magnitude faster than the peripheral devices it was communicating with. Even though task-switching overhead often accounted for a significant share of overall processor consumption, it was cost effective because it maximized throughput for I/O intensive applications and tried to assure that computationally intensive applications didn’t starve everything else.

The dark art of our job as system programmers was balancing I/O, tuning memory management, and tinkering with the scheduling algorithm to find the “Goldilocks balance.” If the mix and tuning parameters caused too much task switching, throughput would go down. If they didn’t swap often enough, processor utilization would fall and throughput would go down as tasks waited for I/O to complete. What I learned from the experts who trained me was something I didn’t have language for at the time: a layman’s approximation of chaos theory. Unusual things happened when processor utilization went above about 85 percent. Tuning was a multidimensional problem—we needed to keep processor utilization below about 85 percent or the system became “chaotic” and throughput would become erratic, often dropping precipitously.

Chaos theory was emerging into popular culture about this same time. Apart from beautiful fractals, chaos theory tries to describe the unpredictable and often exponential decay in flow through a system when parameters rise above certain fuzzy thresholds. This is illustrated by the flow of water through a pipe, for example—at lower pressures water flows smoothly, and flow will increase proportionally with increased pressure (called “laminar flow”). With more pressure, turbulence appears in the pipe, and at some point a slight increase in pressure results in a slight decrease in throughput. (Search for “turbulence and chaos theory” for more insight.)

For our optimized mainframe, that chaotic point was somewhere near 85 percent processor utilization. Beyond that point, performance became chaotic and less predictable, and throughput tended to drop. We could twist knobs and dials to inch up processor utilization from 10 percent to 85 percent in relatively steady increments, but beyond 85 percent, the system would often jump about and then “red line” at 100 percent, with a huge drop in throughput.

In the years since as I have transitioned from systems programmer to project manager, I have observed similar phenomena in the work patterns of individuals and project teams. Most people show up ready and willing to work hard on interesting problems. Trouble can arise if we try too hard to increase productivity beyond a certain point, and our efforts can cause a counterintuitive decrease in throughput.

In project terms, this might take the form of extended periods of mandatory overtime. In the short term, for a well-defined objective, overtime can cause a temporary increase in human system throughput. Attempting to sustain extended periods of high-pressure performance always seems to backfire, resulting in emergent issues like:

  • Burnout
  • Illness
  • Staff turnover
  • Increased errors and rework
  • Work/life conflict with partners and family members
  • Increasing tension and friction among team members

I have seen these emergent issues destroy teams that were highly efficient and effective over the course of a few months. Teams have a “maximum speed,” and if pushed much beyond that for extended periods, the results can be devastating. Unlike computer systems, once a team is broken, it isn’t just a matter of changing a few tuning parameters to get things right.

The other productivity lesson I took away from my performance tuning days was about the overhead of task switching in a multitasking environment. Much has been said in popular culture about the multitasking nature of millennials—the first generation raised on video games and the Internet, capable of listening to music, juggling, dancing, and writing a novel all at the same time. Cognitive psychologists (who aren’t hyping popular culture) tell us that is bunk. Humans don’t task switch nearly as well as computers do. Multitasking consumes lots of cycles but generally doesn’t improve human throughput. Imagine trying to pen a novel, put together a puzzle, and write a computer program if you had to switch from one task to the next every two minutes. While the computer can restore the context of a paused task in a few microseconds, it takes humans several minutes to get back up to speed when restarting a stopped task. The more complex the task, the longer a person is operating at reduced capacity.

All this history is background to observe a simple truth about human productivity and throughput: Sometimes, to go faster, you need to slow down. To maximize productivity, try to minimize multitasking. In both computer systems and human affairs, prioritization is key—prioritize important tasks and try to eliminate, defer, or pass off to someone else low-priority tasks. These are management secrets we can learn from operating systems because the human processor is still the most expensive component of our systems.

User Comments

1 comment
Steven M. Smith's picture

Hi Payson, I enjoyed reading your article. It's excellent. Your advice hits the bulls-eye—task switching is a significant problem. Humans, even Gen Xers, will never be as efficient as computers at handling interupts. So an environment that minimizes interupts, maximizes throughput—all things being equal.

The challenge is whether the people in the organization whose opinion counts the most prefer to reward the the appearance of work over actual throughput. When we tuned mainframes decades ago, there was clarity about whether you completed the required work in the 24-hour window. If the feedback was bad, action was taken immediatley. The same effect doesn't happen with the  appearance of work. Feedback is blurry and it arrives slowly. Interpretation may be spin doctored thus delaying any action. Not surprisingly, that the people doing the work behave chaotically.

I enjoyed how you linked chaos theory to tuning mainframe computer systems. As a former systems programmer, I recall the Rules of Thumbs (ROTs) you mentioned. But I hadn't linked them to Chaos Theory. They fit nicely.

Thank you for bringing back memories of computings earlier days and connecting them to today.

July 1, 2014 - 4:34pm

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.