Excel does a great job as a spreadsheet, but when you try to push it into service as a database you may be in for some rude surprises.
MILLIONS OF PEOPLE ALL OVER THE world use Excel to wrangle data, including people at the National Institutes of Health (NIH), the National Cancer Institute (NCI), and me. In the course of a single day, I discovered some reasons to be cautious choosing Excel as your data wrangling tool. Once, Excel displayed my data but forgot to tell me that it doesn't know how to save it; then, Excel tried to help me by putting invisible data of its own among mine; finally, Excel thought that some researchers wanted a date time value or a floating-point number when they really wanted plain text.
Let's start with my own personal experience. I was trying to generate some Unicode test data in July 2004 by cutting and pasting a Japanese string into Excel. The paste worked fine, and my Japanese string looked perfectly normal as I saved my Excel document full of test data. But when I reopened the document, the Japanese string had turned into a string of question marks, like this: “??????”.
Later that same day, I was writing scripts to generate reports from a large requirements document. The document had begun life as an export into Excel from another application. I wanted to use Perl's DBI modules, which allow me to treat delimited files as SQL databases, but Excel has two "features" that made parsing certain cells extraordinarily difficult: it allows newline characters within a single cell—then it adds double-quote characters to manage those embedded newlines.
On a day when I tripped over Excel twice, other people had even bigger problems involving entirely different Excel features. According to an article published on biomedcentral.com, researchers from a number of health organizations including the NIH and NCI had discovered that Excel had manipulated and corrupted critical reference bioinformatics data related to cancer research. Particularly affected were thirty gene names that look like datetime values and two thousand clone gene identifiers that look like floating-point numbers. The article says, "These conversions are irreversible; the original gene names cannot be recovered."
Corrupting medically critical data is, of course, far more important than corrupting my Unicode test data or foiling my Perl modules, but it was still a bad day for Excel.
Do What I Mean, Not What I Say
Let's look more closely at these errors and see why they happened.
Automatic data conversion without notifying the user
Genes, including tumor suppressor genes, can be named DEC1 or DEC-1, SEP8 or SEP-8 or SEPT8. By default, Excel thinks that these are dates and will convert them irretrievably to "1-Dec" or “8-Sep”. As the article points out, Excel keeps date information in an internal format unrelated to the actual text that it displays to the user. For instance, 1-Apr is represented within Excel as "35885".
So, if you have three genes like so:
Excel will turn those into datetimes and store them identically:
Three different genes are now represented by three instances of the same datetime value. Returning the datetime value to the gene value is impossible, and because the original values describe tumor suppressor genes, guessing is not acceptable.
Other pieces of bioinformatics data, "clone identifiers," may be specified as strings such as 2310009E13. Excel thinks these are floating-point numbers and may convert them irretrievably to 2.31E+13. Unfortunately for the bioinformatics database, there is no way to recover the right-most digits when Excel converts this string to a number.
The authors of the article published Excel workarounds and scripts to identify the problems. But, as I had already discovered, bioinformatics is not the only area where an