Saturday, October 11, 2008

When Reggiex fails and EBNF reigns Part 1

I eat. And when I say I eat, I really mean it. In the span of an hour I can woof down 3 pies, 6 hamburgers, 2 orders of fries and polish it all off with a milkshake. Much of my time when I am not eating is spent thinking about what I am going to eat. My entire day is gone by the time I have decided what I should eat and when to do it (today? tomorrow?); it is all very overwhelming to say the least. To top it all off, this doesn't even take into account my frequent naps!

In order to facilitate my strict regiment of food and napping I needed a logistical way to note and order my plans. I also needed to convert this output so it may be readable by Archie, you know, just in case he needs to know what I will be doing without having to understand my syntax.

todo today {
hour 0-8 nap;
hour 8-9 eat bacon;
hour 9-10 nap;
}

As you can see this is a simple configuration in practice, yet if we were to use regular expressions, the job of parsing could become complex and cumbersome. Moreover if we were to add new options later on, the problem becomes worse. It all comes down to maintainability in these situations; regex just doesn't fit the bill.

Don't get me wrong, regular expressions have a place in the programming world: it's easy to learn, and easy to master. I like to relate regular expressions to my semi-friend Reggie. While he is sometimes useful as a friend (he always has a dollar to spare when I'm down on my luck), he can be a downright jerk when it comes to being nice.

So in these types of situations I like to use "Extended Backus Naur Form" or EBNF for short. EBNF is actually a language of its own that allows you to logically describe sets of input, the output of which is a tree of data. These trees are much simpler to manage and parse by your application if done right.

The syntax of EBNF is a bit daunting to understand at first, but it becomes easier the more you do it. Many of my pals (including Dilton the scientific genius) have run away screaming when confronted with EBNF. I'm going to be blunt: it can be hard, and it will make you think in ways that you are not used to thinking. Don't worry too much though; I think I am pretty darn good at explaining things.

A word to the wise: I am going to stray away from the normal technical document style writing here. I am going to use words and phrases that will not scare away the amateur programmer. Reading a simplistic document is far better than reading a not-so-simplistic one . You know the kind, the kind where you ineluctably reference the internet for the words you don't understand, yuck.

So lets take our first mouth-watering tour of using EBNF by going over the basics. First we need to understand the general concepts. Everything defined can be said to be a key, and a value. Every key/value combination is conjoined by the ":=" operator. For example:

key := value

The key can be referenced in other places, while the value can be either a pointer to a key, or simple text. Here is a good base reference:

food := "french fries"

drink := "milkshake"

Here the key food has a value of "french fries", while the key drink has the value "milkshake". Pretty easy so far, right? Let us venture a little further down the rabbit hole.

food          := "french fries"
drink := "milkshake"
what_to_order := food / drink

We have implemented a new operator in the what_to_order value. The "/" in EBNF means "OR", so in this case what_to_order can either be food OR drink. Or more specifically what_to_order can actually be "friend fries" or "milkshake". I hope you're still with me thus far.

Now the above code is flawed because I hardly ever have food OR a drink, I always have food AND a drink. You can't have one without the other so we have to add a grouping and a postfix modifier to make this right.
food          := "french fries"
drink := "milkshake"
what_to_order := (food / drink)+

The ()'s represent a grouping of 2 or more things. In this case the group is still either food OR drink but not both. The thing here to notice is the '+' at the end of the grouping. What this means is to match this group once or an infinite amount of times, effectively turning the OR into an AND or OR. Think of this way, you can either have food OR a drink an infinite amount of times. With me still? Good!

Before we get to the actual real-world parsing, save this code. It's written in the Python programming language. Big Moose (not the brightest guy I know) uses Python because it's easy. I am using Python because it's quick. Use the code below to help visualize the following examples. It is not necessary to understand this code yet, as this post is just a primer on EBNF. In Part 2 I will be discussing more advanced usage where being able to understand and write Python is essential. For now just make sure your configuration resides in one file, and your EBNF in another.
#!/usr/bin/python
import sys
from simpleparse import generator
from simpleparse.parser import Parser
from simpleparse.stt.TextTools import TextTools

input = None

def traverse_tree(tree, called=0):
if not tree:
return

called += 1

for tag, beg, end, parts in tree:
if not parts:
print " "*called, "%s: %s" % (tag, input[beg:end])
else:
print " "*called, tag

traverse_tree(parts, called=called)

if __name__ == '__main__':
try:
conf_file = sys.argv[1]
ebnf_file = sys.argv[2]
except:
print "Usage is %s " % sys.argv[0]
sys.exit(1)

decl = open(ebnf_file).read()
input = open(conf_file).read()

parser = generator.buildParser(decl).parserbyname('root')
taglist = TextTools.tag(input, parser)

traverse_tree(taglist[1])


Let's return to our original configuration. The first thing we must do is break it up into smaller parts.

Let's see...our configuration always starts with the word "todo" with a name directly after it. A body is defined within an opening and closing brace. This body contains several smaller configurations all starting with the string "hour" and a range of numbers. This is immediately preceded with either "nap" or "eat". A nap is a nap, simple as that (hah!), but when the "eat" keyword is found, the name of a food follows. Both of these statements are terminated by a ";" character.

So let's start out with some commonly-used primitives defined in our EBNF file.
<ws>      := [ \t\n]+
word := [a-zA-Z0-9_:./-]+
digits := [0-9]+
range := digits, "-", digits
  • <ws> means "whitespace" which is defined as either a space or a newline for one or an infinite amount. It appears between <>, which tells your EBNF parser not to display any part of the values in the tree. We don't really need to see whitespaces displayed.
  • word is any character listed within the []s an infinite amount of times.
  • digits is the numbers zero through 9 an infinite amount of times.
  • range is using the key digits twice joined by the character "-" as a value.

Now on to the meat of the EBNF. As we have defined our primitives we can start working on the actual configuration. I prefer to work from the ground up, or from the inside out. Unlike a chicken leg (preferably deep-fried) which I eat from the outside in. Let's first take the easy stuff out of the body (The stuff between "todo today {}").
eat       := "eat", ws, word
nap := "nap"

The eat key will match a set of tokens that match the string value "eat", while the nap key simply matches on the string "nap".

Now that we have those two things laid out, we can work on the prefix to them. In this case it would be "hour x-y" (where x and y are numbers), ending with a ";".
hour      := "hour", ws, range, ws, (eat / nap), ";"

In this statement we look for the word "hour", a whitespace, a range (which is defined in our primitives), another whitespace, and either eat OR nap ending with the character ";".

For the final piece of this puzzle we need to get the wrapping todo and name for the elements listed above.

todo      := "todo", ws, word, ws, ("{", ws, (hour, ws)*, "}"), ws

We first start out with the string "todo", a whitespace, a word (the name of the todo list), then a group. This group contains the character "{", a whitespace, then another group that contains the hour tag with a whitespace an infinite amount of times. This is followed with the "}" character and another whitespace.

We always need a starting point in an EBNF definition; this is most commonly defined as the "root" tag. But in reality it can be whatever name you want. For the sake of simplicity, we will just use "root":
root      := todo

So lets look at the entire EBNF syntax for the configuration.
<ws>      := [ \t\n]+
word := [a-zA-Z0-9_:./-]+
digits := [0-9]+
range := digits, "-", digits

root := todo
todo := "todo", ws, word, ws, ("{", ws, (hour, ws)*, "}"), ws

hour := "hour", ws, range, ws, (eat / nap), ";"
eat := "eat", ws, word
nap := "nap"

You're probably asking yourself right about now, "Why did we do all this breaking up of things?". Good question, to which there is a very good answer. Remember that we are trying to create a simple tree of data that is easy to parse by an application. Each part defined will be directly linked below the last part. To explain more simply, look at the hour key. Once in tree form it looks like this:
hour
range
eat
nap

The same goes for the eat key:
eat
word

Since the hour key calls the eat key, when combined the two it looks like this:
hour
range
eat
word

Now examine the todo key, the structure of which would be something to the effect of:
todo
word
hour

Since our todo key calls hour, which then calls eat, the combination of all three elements creates a logical top-down tree that looks like the following:

todo
word
hour
range
eat
word

Let's take a quick break to reflect on what we just saw. I suggest that if you are lost right now you review the previous examples and reference the EBNF until it makes sense. It can be a little odd-looking at first, even at this simplistic level. Feel free to use the Python code I posted, muck around with the EBNF, and see how the tree changes. Tinkering is always the best way to learn.

Once our entire configuration is read in by the parser we can see a logistical top-down tree like the following. Start thinking about how easy it would be to actually parse this if each key and value were to be given in a for loop.

todo
word: today
hour
range
digits: 0
digits: 8
nap: nap
hour
range
digits: 8
digits: 9
eat
word: bacon
hour
range
digits: 9
digits: 10
nap: nap

The hard part is done, but still not as logically tagged as it could be. See how the name of the "todo" list is prefixed by "word"? I would much prefer "word" to say "name" instead. This requires a slight modification to our original EBNF.

First we create an ignored key that references the word key. I call this a silent word (or s_word). This will make the key invisible within the tree.

<s_word> := word

Then create a name key that references the s_word key:

name := s_word

I change my todo definition to use the name key instead of the word primitive.

todo := "todo", ws, name, ws, ("{", ws, (hour, ws)*, "}"), ws

So the resulting EBNF is:

<ws> := [ \t\n]+
word := [a-zA-Z0-9_:./-]+
digits := [0-9]+
range := digits, "-", digits

root := todo
todo := "todo", ws, name, ws, ("{", ws, (hour, ws)*, "}"), ws
name := s_word
<s_word> := word

hour := "hour", ws, range, ws, (eat / nap), ";"
eat := "eat", ws, word
nap := "nap"

And once parsed, the tree looks like this:

todo
name: today
hour
range
digits: 0
digits: 8
nap: nap
hour
range
digits: 8
digits: 9
eat
word: bacon
hour
range
digits: 9
digits: 10
nap: nap

Looking good, but still a little messy when looking at the eat key. In the prefered output, seeing the "food" (bacon) pushed up directly under the eat key would be optimal. To facilitate this we must use a new key operator, which is best explained when viewed.

The eat key is changed to the following, and a new key is created to_eat with some added voodoo.

eat := s_word
>to_eat< := "eat", ws, eat

The final change is to the hour tag that calls to_eat instead of eat.

hour := "hour", ws, range, ws, (to_eat / nap), ";"

Now to explain the use of >to_eat< instead of just plain old to_eat. The >< tells the parser to make the value a direct descendant of the prior key. Another way to put it: anything calling this key will have a value falling under the original calling key (hour in this case). To visualize this better, our tree now looks like this:

todo
name: today
hour
range
digits: 0
digits: 8
nap: nap
hour
range
digits: 8
digits: 9
eat: bacon
hour
range
digits: 9
digits: 10
nap: nap

This looks better, but is still not perfect in my opinion. That hour->range->digit thing still bugs me. I want to somehow get the start hour and the end hour directly under the hour definition, ridding myself of the range key. In the end this will make it a little less confusing and a bit easier to parse. Now we use both of the techniques described above to make this happen.

First we create a silent version of the digits key:

<s_digits> := digits

Then let's create the keys start and end which will designate a start hour and an end hour. These keys will effectively remove the need for the range identifier, so remove it.

start := s_digits
end := s_digits

Create an expanded version of a start and end tag joined by a "-" to fall under the previous tag hour.

>hour_range< := start, "-", end

Create an expanded key called hours calling hour_range which will fall under the previous key hour also.

>hours< := hour_range

Lastly we modify our hour values and replace the old range with the new hours key.

hour := "hour", ws, hours, ws, (to_eat / nap), ";"

So now our final EBNF should look something like this:

<ws> := [ \t\n]+
word := [a-zA-Z0-9_:./-]+
digits := [0-9]+

root := todo
todo := "todo", ws, name, ws, ("{", ws, (hour, ws)*, "}"), ws
name := s_word
<s_word> := word

hour := "hour", ws, hours, ws, (to_eat / nap), ";"

<s_digits> := digits
start := s_digits
end := s_digits
>hour_range< := start, "-", end
>hours< := hour_range
>to_eat< := "eat", ws, eat
eat := s_word
nap := "nap"

And our final parsed tree should look something like this:

todo
name: today
hour
start: 0
end: 8
nap: nap
hour
start: 8
end: 9
eat: bacon
hour
start: 9
end: 10
nap: nap
This document at first was long and drawn out, so I decided upon breaking it up into two smaller parts. In the next installment of this two-part series we will extend our Python code to include an engine that traverses this tree to create a structure that is easily manipulated and morphed into any output. We will also add all types of configuration options to optimize our time spent writing about what we should eat and when to nap.

According to my todo list, it's just about time to take a nap, so I take my leave. I hope you have enjoyed this article and leave yearning for more like a hot plate of spaghetti & meatballs.

On a side note: Arch, please stop chasing after Veronica, she is a stuck-up, spoiled girl. You are much better off with Betty since she at least cooks good food.

No comments:

Followers