Data Crunching Tips and Techniques

[article]

Simple, right? OK, it isn't. As I said earlier, the notation can be cryptic. But once you've mastered REs, they can make complex jobs easier. For example, here's the first part of our transform function:


The first line inside the loop tries to match the regular expression against the line of text. If it doesn't match, the program reports an error. If it does, the program grabs whatever text matched the parenthesized groups inside the RE. For example, if the line is:

then var will be assigned "mouse," and params will be assigned "'fast', chord'."

To get the individual parameters, we use another pattern that matches the separators--in this case, a comma followed by zero or more spaces, or ,\s*. Adding this to the code above gives us:

Creating XML

Each line of input is independent of the others, so we could create XML simply by printing strings. However, experience has taught me that it's usually a bad idea to treat structured data as strings--sooner or later, the structure is actually needed and the crunching code has to be rewritten.

The standard way to work with XML in a program is to use the Document Object Model (DOM). As defined by the World Wide Web Consortium, DOM is a cross-language set of objects that represent elements, attributes, text, processing instructions, and all the other weird and wonderful things that can appear in XML. For example, the XML document:


corresponds to the object tree shown in Figure 1. Note that:

  • The root of the tree must be a Document object, whose single child is the root element of the document.
  • All of the text--including the whitespace between elements--is stored.

There are several DOM implementations in Python, such as minidom (which is part of the standard library), and packages, like Fredrik Lundh's ElementTree, that have similar features, but more Pythonic interfaces. There are also special-purpose tools, like XSLT, which are custom-built for working with XML. In practice, though, I've usually found these special-purpose tools to be more trouble than they are worth. Especially since most don't include features like regular expressions and database libraries that my crunching programs need.

For our purposes, minidom will do fine. What we have is a list, each of whose elements is a variable name and a (possibly empty) list of parameters. What we want is some XML. Let's start by creating the document and its root settings element:

XDD10384imagelistfilename11

User Comments

12 comments
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am

Pages

About the author

AgileConnection is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.