Automated source code analysis is an umbrella term that describes a family of software utilities and tools designed to help programmers discover, diagnose, and prevent a broad range of errors and undesirable behaviors in their code. After years in the background, automated code analysis tools are becoming more prominent, more powerful, and more necessary than ever before.
There are two basic types of automated code analysis: static, in which the analyzer examines source code or object code without actually executing it; and dynamic, in which the code and its effects are examined during execution. Some problems are best addressed by static analysis, others by dynamic analysis, and many problems are best solved by a combination of static and dynamic approaches.
I'll begin by examining the evolution of both static and dynamic automated code analysis tools—from their humble but promising beginnings to becoming an indispensable and ubiquitous technology. I will also look at where automated code analysis is today and introduce you to the latest tools. I will then give you a peek at a very exciting future—a new breed of automated code analysis technology that will help developers address and solve their thorniest, most time-consuming problems with unprecedented efficiency and effectiveness.
Basic Code Analysis: From Lint to IDEs
The first tools to perform a primitive form of code analysis were compilers. To compile cleanly, code has to be lexically correct, so the first common form of analysis was the simple verification that a program fit the grammar of the language. Compilers that operated at this minimalist level were used well into the mid-1980s and were characterized by rather poor diagnostic capabilities. Published reviews in those days rated compilers partly on their ability to prevent pages of spurious error messages. One missed semicolon could so confuse compilers that they would emit cascading error messages, most of them wrong or misleading. Experienced programmers became adept at recognizing long sequences and knowing the fix they needed—often as simple as adding that missing semicolon.
Early compilers had another limitation on their code analysis. They were purpose-built to translate source code into executable binaries, not to pass judgment on your code or help you write cleaner, more robust code. As a result, they checked only for correct syntax and remained silent on constructs that were grammatically correct but deeply suspicious. This design was a function of the times. Hardware systems had such little power that compilations ran very slowly. Spending more time to perform extra analysis with every compile was not a trade-off many vendors were keen to offer or developers to accept.
On-demand, automated checking for common coding errors, however, was definitely something desirable. In 1977, Steve Johnson of Bell Labs introduced a utility called Lint—from the idea that removing errors and coding accidents from a program is like picking lint off clothes. Lint caught errors such as using "=" when you meant "==" or inconsistent function interfaces across code modules. It was widely used in the Unix community, especially as a final sanity check before code check-in or major builds. However, the usefulness of Lint was in those days somewhat limited by the fact that running it was part of a rather lengthy and not very interactive edit-save-analyze cycle. You did not get feedback until you saved your source code, exited the compiler (in those days you did not have multiple windows), and ran Lint on it. On top of that, the Lint output was delivered on a command-line console, which looked something like this:
To address the warning, you had to manually go back to the editor, load the source file responsible for the warning, and scroll to the line in question to fix the problem. Then you'd have to rerun Lint to see if the fix worked. Imagine doing this for dozens of warnings (which was not uncommon). Nevertheless, Lint proved to be a very valuable tool.
Within a few years, the combination of increased computing power and the advent of integrated development environments (IDEs) made it possible to perform basic source code analysis in real time and highlight warnings and possible errors in the editor—often before the developer's next keystroke. The functionality once provided by Lint is now an integral part of most IDEs. Using a modern IDE, if I type the following code as part of a program:
the IDE will let me know almost instantly that I should have been using "==" instead of "=" and I can fix the problem on the spot. The time saved by this rapid feedback and level of interactivity may seem trivial, but when you multiply it by the number of times an IDE’s built-in static checker catches a silly mistake, the savings probably add up to a few days per developer per year.
In addition to basic, built-in static analysis, most IDEs today have optional plug-ins to perform more thorough analyses. At the time of this writing, for example, the popular open source Eclipse IDE has twenty-six plug-ins in the category of "source code analyzer" listed on its Web site.
These plug-ins range from enhanced coding rules/style checkers (e.g., Checkstyle, JLint, PMD) to analyzers for detecting and removing duplicate code (e.g., Duplication Management Framework), and they fill some of the holes that the built-in analyzer may have left. For example, many beginning Java programmers at some time or other
mistakenly use "==" to compare string values (i.e., using name == "Elvis" instead of name.equals("Elvis"), thus comparing object references instead of string equality). This is probably the most common error Java programmers make, but somewhat surprisingly, the default configuration of Eclipse does not issue a warning for it, though most of the code rules/style checker plug-ins do. Here's the warning I get from Eclipse after installing the Checkstyle plug-in:
When it comes to statically analyzing code for common programming errors and coding rule or style violations, we have made a lot of progress since the days of Lint. The combination of IDEs with built-in analyzers and third-party tools and plug-ins makes it possible to identify and fix a large number of those errors during development. But in the meantime the world has become a more interconnected, complex, and dangerous place. The software that runs much of our world’s operations today must be hardened against new categories of potential errors and threats. Fortunately, new breeds of code analysis tools are coming to the rescue.
New Era, New Threats: Analyzing Code for Security and Intellectual Property Violations
For the foreseeable future, security is going to be one of the most active target areas for automated code analysis. Prior to the incursion of the Web into enterprises' computing infrastructures, application security was maintained primarily by securing the physical plants and limiting the extent of the networks. This baseline scheme is hard to implement today because the benefits of making all information easily accessible through more widely spread networks is simply too compelling. As a result, most enterprise data is theoretically reachable through a combination of private and public networks, and hackers have become experts at breaking through network security and unhinging programs by feeding them unexpected data.
Two good examples of this type of attack are SQL injection and buffer overruns. SQL injection involves a hacker's entering invalid responses to a form in hopes that the database server will mistakenly execute commands. Buffer overruns occur when overly long data items are entered in an attempt to force the application to abort in a way the hacker can control. Poor coding practices cause both of these security gaps.
Some general-purpose code analyzers can help you find code that poses obvious security risks. The open source analyzer PMD checks for the following risky coding practices:
These are two good coding rules, but they barely scratch the surface of the security vulnerabilities that might be hiding in your code. To dig deeper, you need dedicated security analyzers. Fortunately, in the last few years companies like Fortify Software and Secure Software have brought to market automated code analysis tools specifically designed for detecting insecure code, and it's easy to predict that this category of automated code analysis tools is here to stay.
To ensure that your valuable data is not stolen or compromised, you need to analyze your code for security holes. But you also need to take into account the other side of the coin—making sure that your own code is not guilty of inadvertent copyright or intellectual property violations.
In this era of open source and online code repositories, it's easy to innocently use or copy code that you either have no right to use or whose usage comes with limited rights or obligations. Code in most open source projects, even from crusaders such as the Free Software Foundation, is copyrighted. To use the code, you must comply with the terms of its license. If you inadvertently use such code in your product, you could be forced to open source your product or pay hefty penalties to the copyright holder.
Fortunately, this is another area where automated code analysis can come to the rescue, and already there are products on the market. The protexIP suite from Black Duck Software, for example, utilizes a technology called Code Print and an online knowledge base to automatically recognize open source and other thirdparty software that has been introduced into a software project's code.
Dynamic Code Analysis: Make Your CPU Work as Hard as You Do
The huge increases in processing power we have witnessed in the past decade have made it possible to perform many types of static analysis in real time and to give developers instant feedback. But processing power has now reached the point where we can go beyond automatic static code analysis and discover entire new categories of potential problems by analyzing the code—and the whole system—during actual execution.
The idea of dynamic code analysis is not new; performance profilers and memory leak detectors, which have been around for quite a while, are examples of tools requiring code execution. But the new types of dynamic code analysis tools—invented to take advantage of the massive increase in computing power available to developers—are exciting. Today there is a surplus of available CPU cycles, and many developers' CPUs are idle most of the time. This is a relatively new situation for programmers.
A few years ago, it was normal for developers to have to wait several minutes for the compiler to do its job. That's why so many programmers knew how to juggle—there was no Internet to browse, and juggling was a good way to kill a few minutes while waiting for the compiler to finish.
Today the situation is often reversed. While I am developing code, the processor is mostly idle until I trigger a build—then for a few seconds my computer buzzes with activity. Before I can even consider taking a mental break, the compilation is finished, the processor goes back to idle, and I am back at work. No wonder my juggling skills aren't what they used to be.
Considering how valuable and expensive developer time is compared to CPU time, it makes sense to come up with CPU-intensive tasks that can help programmers be more effective and efficient with their time.
Activities to Keep Your CPU Very Busy
Dynamic Code Analysis and Testing
Software testing is one of the areas most in need of help and attention, and it is a perfect candidate for leveraging all available CPU power. Practitioners of Extreme Programming (XP) and other agile software development methodologies have rediscovered the benefits and importance of having developers test their own code by writing and executing unit tests before integration. Thanks in part to the growing popularity of these agile methodologies, many developers and development managers now agree that having a suite of unit tests written by developers and executed after each build is one of the best ways to improve overall software quality and reduce the cost of defects.
Unfortunately, developing a thorough set of tests for a particular piece of code is often more difficult and time consuming than writing the code in the first place. It typically takes 300 to 400 lines of test code to provide 90 to 100 percent code coverage for one hundred lines of code under test. The combinatorial nature of testing makes it a perfect candidate for automation based on dynamic analysis, and lately there has been a lot of activity and development in this area from both academia and commercial companies. See the StickyNotes for a discussion of some specific academic studies and commercial developments in dynamic, analysis-based automation.
Dynamic Code Analysis for Performance and Memory Usage
Performance profiling is a well-established form of manual dynamic code analysis. The developer instruments the code using a performance profiling tool, then drives the code with some tests or examples, and manually examines the performance profile to identify and address bottlenecks. This type of code analysis can yield substantial performance improvements, but doing it manually can be difficult, tedious, and time consuming. Compilers have long optimized code based on static analysis, but some compilers now automatically optimize performance based on dynamic code analysis, saving developers a lot of work. Intel's family of compilers, for example, uses a technology called profile-guided optimization, which uses data collected during a training execution of the application to improve performance with additional optimization rounds that go beyond the reach of static analysis
One day, dynamic analysis will be able to go much further than this by taking into account and automatically testing a wide range of variables that could have an impact on performance. Imagine a dynamic performance analysis tool that takes a version of the code, along with a representative set of sample inputs or tests, and analyzes it to identify parameters that might affect performance (for example, the buffer size in a file read operation). Then it tweaks those parameters and runs a set of performance experiments to see whether different settings can improve performance for a specific system configuration. I don't know about you, but I'd love a performance analysis tool that could tell me something like this:
"After running 342 tests on representative inputs and system configurations, it was determined that the optimal value for the variable bufSize is 1024 bytes (currently 4096 bytes). Click here to make that change."
Taking Advantage of Grid Computing—The Next Obvious Step for Automated Code Analysis
Examples of dynamic code analysis like the last one are not as far-fetched as they sound, especially if automated code analysis tools evolve to take advantage of grid computing. With grid computing, the processing power, storage, and bandwidth of thousands of heterogeneous computing nodes can be combined to give users and applications seamless access to vast IT capabilities and a large number of system configurations. Scientific organizations and government labs have been using grid computing for years, and now companies like IBM and Sun Microsystems are commercializing this technology and providing on-demand computing power. Need to do some deep code analysis for your two million lines of application code but you only have a couple of hours? Sun's Sun Grid will sell you on-demand computing power for $1 per CPU-hour.
I know I am stretching the definition of automated code analysis. Perhaps I should use a broader term like “automated application analysis,” because in modern software products the actual code is just one of a collection of components—configuration files, installers, etc.—that have to work together for the application to function. Whatever the name, I believe that the scope of these tools must continue to evolve to match the needs of developers and take advantage of the available technology.
At the dawn of code analysis, software applications were written in one language, targeted at a specific operating system and, often, a specific hardware configuration. Today’s applications often use several programming languages and must support multiple operating systems and countless hardware configurations. A friend of mine who works for a major enterprise software company calls this "The Matrix of Death." The only way to deal effectively with The Matrix of Death, as well as with new challenges like security and IP violation, is to develop and leverage new technology and automated analysis tools.
Software is more complex than ever; assuring the integrity, functionality, performance, and legality of large code bases across a wide range of system configurations is a daunting challenge. Fortunately, processor performance increases since the days of Lint have enabled huge advances in automated code analysis technology, the evolution of extremely useful tools, and the creation of new categories of analysis. I believe that grid computing will trigger the next and most impressive step in the evolution of automated code analysis. Until then, I encourage you to start exploring the great open source and commercial products that are available today. If you still equate automated code analysis with Lint, you are in for a pleasant surprise. And remember, you should make that CPU work as hard as you do.
Dynamic analysis-based automation studies and developments
A team at MIT headed by Michael Ernst, for example, has developed a tool called Daikon that automatically detects potential program invariants while the code is being executed. You can think of an invariant as a property of the code that is always true at some point in the program, such as
account.getBalance() >= 0. Automated discovery of program invariants is a great use of processor cycles, because invariants provide unique insight into the most fundamental properties of the code. Once discovered, invariants have many applications; typically they are integrated with other automated testing components and turned into assertions for runtime checking or automated test generation. Agitar Software for example, used the idea of invariant detection to develop a new form of automated dynamic code analysis called software agitation. Software agitation combines invariant-detection technology with sophisticated automated test-data generation and run-time instrumentation. When a piece of code goes through the process of agitation, it’s first instrumented, then executed a number of times with various inputs in order to achieve maximum code coverage. While the code is being executed, the invariant detector discovers code invariants, which are then presented to the user for evaluation. If an observed behavior matches the specification or expectations (e.g., account.getBalance() >= 0 the developer can promote the observations into a durable test assertion. If an observed behavior is not what’s expected or desired (e.g., accountBalance.getBalance() throws ArrayIndexOutOfBoundsException, the developer knows that there are some problems with the code.
Unit tests are a valuable asset, and their value increases with the frequency of execution—the more often you run tests, the sooner you can find and fix problems. I consider frequent, even continuous, running of existing tests another form of dynamic code analysis and another great use of CPU time. A recent, controlled experiment at MIT (Continuous Testing of Software During Development, David Saff, et al. 2005) has shown that "…continuous testing has a statistically significant effect on developer success in completing a programming task, without affecting time worked. Developers using continuous testing were three times more likely to complete the task before the deadline than those without. Most participants found continuous testing to be useful and believed that it helped them write better code faster, and 90% would recommend the tool to others." A team at MIT has developed a continuous testing plug-in for the Eclipse IDE a continuous testing plug-in for the Eclipse IDE that delivers some of these benefits to Java developers. It’s definitely worth checking out.