Data Crunching

[article]
Part 1

is just a pattern that can match a piece of text. These patterns can express complex rules in a compact way. When a match is found, the pattern remembers which bits of text lined up with which bits of the pattern, so that the programmer can extract substrings of interest.

The bad news is that RE notation is one of the most cryptic notations programmers have ever created (and that says a lot). When mathematicians want to express a new idea, they can just create some new symbols on a whiteboard. Programmers, on the other hand, are restricted to the punctuation on a standard keyboard. As a result, many symbols can have two or three meanings in a RE, depending on context. What's worse is that those meanings can be slightly different in other languages. Therefore, you may have to read someone else's RE carefully to understand what it does.

For example, here's a RE that matches a variable name, some optional spaces, an opening parenthesis, some text, and a closing parenthesis:

Let's decipher it in pieces:

  • The ^ is called an "anchor". Rather than matching any characters, it matches the start of the line. Similarly, the $ anchor at the end of the RE matches the end of the line.
  • The escape sequence \s matches any whitespace characters, such as blank, tab, newline, or carriage return. The * following it means "zero or more," so together, \s* matches zero or more whitespace characters. This allows the pattern to handle cases in which the variable name in the input line is indented.
  • \w matches against "word" characters (which to programmers means alphanumeric plus underscore). Putting + after it means "one or more," (i.e., the variable name has to be at least one character long). Putting parentheses around the whole sub-expression signals that we want whatever matched this part of the pattern to be recorded for later use.
  • We then have another \s*, to allow spaces between the variable's name and the first parenthesis.
  • The parenthesis itself has to be escaped as \(, since a parenthesis on its own means "Remember whatever matched this part of the pattern." We also have to escape the closing parenthesis as \).
  • Finally, . on its own matches any single character, so (.*) means, "Matches zero or more characters, and remember them."

Simple, right? OK, it isn't. As I said earlier, the notation can be cryptic. But once you've mastered REs, they can make complex jobs easier. For example, here's the first part of our transform function:


The first line inside the loop tries to match the regular expression against the line of text. If it doesn't match, the program reports an error. If it does, the program grabs whatever text matched the parenthesized groups inside the RE. For example, if the line is:

then var will be assigned " mouse," and params will be assigned " 'fast', chord' ."

To get the individual parameters, we use another pattern that matches the separators--in this case, a comma followed by zero or more spaces, or , \s*. Adding this to the code above gives us:

Creating XML

Each line of input is independent of the others, so we could create XML simply by printing strings. However, experience has taught me that it's usually a bad idea to treat structured data as strings--sooner or later, the structure is actually needed and the crunching code has to be rewritten.

The standard way to work with XML in a program is to use the Document Object Model (DOM). As defined by the World Wide

User Comments

12 comments
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am
John Leather's picture

Greg,<br/><br/>Great article, I look forward to part 2! I do have one question, where is Figure 1?<br/><br/>Thanks,<br/><br/>John Leather

March 1, 2006 - 3:19am

Pages

About the author

Greg Wilson's picture Greg Wilson

Greg Wilson’s book Data Crunching was published by the Pragmatic Bookshelf in April 2005. He received a PhD in computer science from the University of Edinburgh in 1993 and is now a freelance software developer, a contributing editor at Doctor Dobb's Journal, and an adjunct professor in computer science at the University of Toronto.

AgileConnection is one of the growing communities of the TechWell network.

Featuring fresh, insightful stories, TechWell.com is the place to go for what is happening in software development and delivery.  Join the conversation now!