Data Crunching

Part 1

It's 9:00 on a Monday morning. You're sitting at your desk savoring that precious first cup of coffee and looking forward to finally finishing that rendering routine when your boss knocks on your door. She says, "I have a little job for you." It seems the product manager was wrong: customers do want to convert their old flat-text input files into XML. Oh, and the three people who actually bought Version 6.1 of the product want to merge parameters from the database as well. Now you've got to take care of it--by the end of the day.

Little data crunching jobs like this come up every day in our business. They aren't glamorous, but knowing how to do them with the least amount of effort can be crucial to a project's success or failure.

Fifteen years ago, most data crunching problems could be handled using classic Unix command line tools, which are designed to process streams of text one line at a time. Today, however, data is more often marked up in some dialect of XML or stored in a relational database. The bad news is that grep, cut, and sed can't handle such data directly. The good news is that newer tools can, and the same data crunching techniques that worked in 1975 can be applied today.

This article looks at what those tools and techniques are, and how they can make you more productive. We start with a simple problem: how to parse a text file.

Extracting Data from Text

The first step in solving any data-crunching problem is to get a fresh cup of coffee. The second is to figure out what your input looks like and what you're supposed to produce from it. In this case, the input consists of parameter files with a .par extension, each of which looks like this:

Each line is a single setting. Its name is at the start of the line and its value or values are inside parentheses (separated by commas if necessary).
The output should look like this:

Most data-crunching problems can be broken down into three steps: reading the input data, transforming it, and writing the results. wc *.par tells us that the largest input file we have to deal with is only 217 lines long, so the easiest thing to do is read each one into an array of strings for further processing. We'll then parse those lines, transform the data into XML, and write that XML to the output file. In Python, this is:

Separating input, processing, and output like this has two benefits: it makes debugging easier and allows us to reuse the input and output code in other situations. In this case the input and output are simple enough that we're not likely to recycle them elsewhere, but it's still a good idea to train yourself to write data crunchers this way. If nothing else, it'll make it easier for the next person to read.

All right, let's begin by separating the variable name from its parameters, then separate the parameters from each other. Hmm . . . can there ever be spaces between the variable name and the start of the parameter list? grep can tell us:

Another quick check shows that while parameter values are usually separated by a comma and a space--sometimes there's only a comma.

This sounds like a job for regular expressions, which are the power tools of text processing. Most modern programming languages have a regular expression (RE) library. A few, like Perl and Ruby, have even made it part of the language. A RE

User Comments

John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am


About the author

Greg Wilson's picture Greg Wilson

Greg Wilson’s book Data Crunching was published by the Pragmatic Bookshelf in April 2005. He received a PhD in computer science from the University of Edinburgh in 1993 and is now a freelance software developer, a contributing editor at Doctor Dobb's Journal, and an adjunct professor in computer science at the University of Toronto.

AgileConnection is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, is the place to go for what is happening in software development and delivery.  Join the conversation now!

Upcoming Events

Sep 22
Sep 24
Oct 12
Nov 09