Data Crunching, Part 2

[article]
Summary:

In part one of "Data Crunching," Greg Wilson taught us how to translate legacy data into XML. In the second half, he explains how to merge new data into an existing database. Developers will always face these types of data crunching problems, and knowing the standard data crunching tools can save you a lot of time. Greg also shares the basic knowledge about relational databases that every developer should possess.

An hour and a half ago, the boss announced that we need to write a tool to convert old, flat-text configuration files for the company's major product into the new XML format used in the next release. The tool also needs to be able to merge parameters from a relational database, since some genius (no longer with the company) decided that an earlier version of the product would keep its configuration there.

This job is an example of the kind of data crunching that programmers have to do over and over again. The first article in this series explained the use of regular expressions and DOM to tackle converting data. In this article, I introduce the absolute minimum about relational databases that every developer should know, and discuss how to use it to solve the rest of the data-crunching problem.

Merging with Database Information
It's now 10:30 a.m., and the first half of the job is done; we can convert old-style text files into the new XML format. Time to worry about the second half-merging information from the database. When the boss issued the assignment, we didn't even know there was one, but it turns out that an earlier version of the product allowed users to store shared parameters in a relational database, which was dropped in later releases. If we're going to upgrade, we might as well do everything at once and right the first time.

Crunching databases is a bit different from crunching text files and XML because someone else-in this case, the database manager-owns the data you want. In order to get it out, we have to:

  • Connect to the database (which often requires authenticating)
  • Construct an SQL query
  • Read the query's results

For this job, the values we need are all in a single database table called

Settings, which has three columns (Figure 1):

  • Path, which was the absolute path to the user's parameter file
  • Var, which was the name of the variable
  • Param1, Param2, and Param3, which held either parameter values (as strings) or NULL.

This is a lousy design for two reasons. First, if a user decides to move her parameter file to a new location or rename it, the database would become confused. Storing the user's ID would have been much better than storing the
path, but still far from perfect. Second, this schema only allows a variable to have three parameter values, but one (the intervals
variable shown in the previous article) has four. Even if that weren't the case, padding tables with enough columns to handle the worst case always leads to code that is complex and fragile.

The right way to handle this would have been to use two tables. The first would record the user identifier (either the path to the configuration file or the user ID), the variable name, and the number of parameters. The second would record the user identifier, the variable, the index of the parameter, and the parameter's value (Figure 2).

But we have to work with what we're given, so let's modify the crunching script so that the database name is the first parameter, before all the input files. Then we'll add a new function, merge, to combine values from the database into the XML document we've already built:

Now to write the merge routine. This has two parts--one to grab values from the database, and another to put them into the XML document. The database part looks like this:
 

The first line of this function connects to the database. In Python, as in other

About the author

Greg Wilson's picture Greg Wilson

Greg Wilson’s book Data Crunching was published by the Pragmatic Bookshelf in April 2005. He received a PhD in computer science from the University of Edinburgh in 1993 and is now a freelance software developer, a contributing editor at Doctor Dobb's Journal, and an adjunct professor in computer science at the University of Toronto.

AgileConnection is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!

Upcoming Events

Sep 24
Oct 12
Oct 15
Nov 09