Mistakes happen. It's how you respond to them that matters. Teams might react to a bug with panic and blame, leading to a quickly hacked fix and possibly more issues. Taking time to investigate and learn leverages problems into process and practice improvement and a higher quality product.
There’s one thing I know for sure: We should never stop learning. Even if our team gets close to zero defects in production, ugly bugs can still escape. Mistakes can happen. However, if we can learn from our mistakes, we can improve our processes and practices and avoid making similar mistakes in the future.
The key to turning problems into improvements is a learning culture. In the past, I’ve worked on teams where managers looked to place blame for mistakes and punished team members for problems. That resulted in a team on which everyone was afraid to raise issues because the team might get into trouble. Problems were swept under the rug with no possibility for improvement.
Fortunately, my current team enjoys an enlightened management that gives us time to experiment, research, and continually learn to do better work. When a defect is found in production, we don’t waste time pointing fingers. We research the defect, fix it, and try something new to improve future development. What follows is a recent learning opportunity we experienced.
A Bit of Background
We’re a team of four developers, two testers, a DBA, two system administrators, a manager, and a ScrumMaster. We develop and support a Web-based financial services application. Six years ago, we had a buggy legacy application. Over the years, we’ve greatly improved the quality of our software, delivering what our customers want with very few defects slipping past our testing and coding process. We use Scrum and XP practices, along with some techniques borrowed from lean and kanban. We know our business domain well and work closely with our stakeholders.
In our last sprint, one of the operations managers discovered a high-severity bug. One of our customer reports displayed an incorrect value because some data was being left out of the calculation. I decided to investigate. I ran the same job in test, and while it wasn’t correct, the result was different from in production, given the same inputs. This was puzzling, to say the least! I conferred with the manager, who provided me with the value that should be on the report. She had calculated it manually.
The defect is in an area of the application that we rewrote about five years ago. It is highly complex, so we spent a lot of time doing thorough testing. We have many automated regression tests for this subsystem at the unit, functional, and GUI levels. However, I was surprised to discover that the particular functionality where the bug occurred had no automated regression tests above the unit level. Although there is a lot of documentation about this subsystem on our team wiki, no one had made any notes there about why we decided not to automate tests for this part of the code. I was puzzled.
After discussing the issue, the programmer tasked with production support did more research. He discovered that the logic to decide what data to include in the calculation was not in the Java code but in a view in our Oracle database. That explained the lack of automated regression tests—it would be impossible to test without using the database and, using the database, it would be costly to write a test.
Worse, there were “old” and “new” versions of the view. A synonym in production with the same name as the old one pointed to the new one. However, the schema we were using in test had only the old version. This explained why the results were different in test.
First lesson learned: Make sure your test environment truly is a duplicate of
|Leveraging A Learning Culture||374.15 KB|