Archive

Posts Tagged ‘parser’

What is so great about pyparsing?

19 April 2010 1 comment

or:

simple .ini parser with pyparsing

The answer to the question above is: readable regular expressions. Code readability is probably the most common reason (or one of the most common reasons) why someone decides to code something in python. When you come to an area of regular expressions, no matter how good you are at writing them, the problem of revisiting, refactoring and modifying is always quite big. It is simple – the regular expressions, no matter how much powerful, are quite unreadable. Pyparsing deals with this problem.

Pyparsing is a library completely written in python that provides a set of classes and utilities for building a readable grammar expression for parsing of extremely complex structured texts. You don’t need to know regular expressions, in fact, during the time I used it – I never stumbled upon a regular expression created by this library or one that the library needed me to code, however, you can start with parsing your text right away. If you’re not familiar with the topic of parsing, you can read a very good introductory article by Paul McGuire here, where he also explains the formal declaration of parsing grammars known as Backus-Naur form (BNF).

The same author also wrote this wonderful beginner’s book which includes about 90% of all you’ll need to know on parsing with pyparsing.

There is no better way of presenting this one than with an example. Allow me the pleasure to have a non-sense windows .ini configuration file in the manner of (monty) python:

[db]
user=eric
pass=idle

[timeout]
ip=127.0.1.1
time = 4

[users]
names = idle, gilliam

Now,  the problem is how to fetch this into a useful python dictionary? We notice that the configuration file is separated in 3 namespaces (db, timeout, users), and each of them contains one or more definition lines that contain the literal “=”. How does pyparsing work? It works by creating different grammars for all the elements in the texts and later combining and grouping them in one unified grammar. Maybe also defining specific parse actions or setting names. Let’s go on with a “hello world” example:

from pyparsing import Word, alphas

grammar = Word(alphas)
tokens = grammar.parseString("hello!")
print tokens

– result –
['hello']

You can see that the exclamation point did not enter the resulted token since the grammar expression is just a word with alphabet characters.

Let’s dive into our problem. Each of the three namespaces has a header in brackets. We will define it as:

word = Word(alphas)
header = Suppress("[")+word.setResultsName("header")+Suppress("]")+LineEnd()

Of course, all the new names you’ll have to import from pyparsing (Suppress, LineEnd and later some others). We first defined a word grammar because we will use it again later. Suppress will tell the grammar not to include this expression in the results, thus, preventing the clutter of brackets in the end. One nice thing about this is the .setResultsName() method that enables referencing specific name from the resulting tokens.

We see that all the definition lines are sepatated with “=” and on the left side is the definer which is a simple word. The values on the right side, however, are varying, and are one of the following: word, list of worlds, number, ip. Thus, we have the following grammars:

number = Word(nums)
list_of_words = Group(ZeroOrMore(word + Suppress(",")) + word)
ip_field = Word(nums, max=3)
ip = Combine(ip_field+"."+ip_field+"."+ip_field+"."+ip_field)

definer = word.setResultsName("definer")
value = Or([number, word, list_of_words, ip]).setResultsName("value")

Here, we can see how to build list of stuff separated with something (ZeroOrMore) and combining tokens into one (Combine). The Word grammar object has parameters for limitations of its definition like max in this example, but also exact, bodyChars and min. Also, as our right side in the definition line varies, we use the Or expression builder. Of course, there is also And().

Now, we are moving to the finalization of our parser. We have all our elementary building blocks needed (header, definer, value), so, we can build the more complex ones. This is how:

definition_line = Group(definer+Suppress("=")+value+LineEnd())
namespace = Group(header+\
................OneOrMore(definition_line).setResultsName("definition_lines"))
all = OneOrMore(namespace).setResultsName("namespaces")

Now, what have we done here? Let’s review from top to bottom. The complete grammar defined as all consists of one or more (OneOrMore() ) namespaces. Each namespace consists of header and one or more definition lines. And each definition line consists of a definer and a value. We added some Group() clauses as well as .setResultsName() on the parts we liked to name our result – and we are ready to parse! Get on with it!

result = all.parseString(content)

Huh? That’s it?!? Yes. Our result is neatly placed in a nice tree structure we can traverse with the attributes we set with .setResultsName(). You can check it with these:

for namespace in result.namespaces:
....print namespace.header
....for definition_line in namespace.definition_lines:
........print definition_line.definer
........print definition_line.value

Of course, I won’t be kind enough to present you with the complete parser we just built here. Fetch the content from a file and do proper (non-wild) import of all the building parts from pyparsing.

What is so great about pyparsing? Well, we don’t have to learn regular expressions. We got out results in a nice data structure. The library is in python and it is very easy to dive into it. But here is the greatest asset: it is readable – you can go back to the code at any time and modify it with ease!

Categories: programming Tags: , ,