Home > programming > What is so great about pyparsing?

What is so great about pyparsing?


simple .ini parser with pyparsing

The answer to the question above is: readable regular expressions. Code readability is probably the most common reason (or one of the most common reasons) why someone decides to code something in python. When you come to an area of regular expressions, no matter how good you are at writing them, the problem of revisiting, refactoring and modifying is always quite big. It is simple – the regular expressions, no matter how much powerful, are quite unreadable. Pyparsing deals with this problem.

Pyparsing is a library completely written in python that provides a set of classes and utilities for building a readable grammar expression for parsing of extremely complex structured texts. You don’t need to know regular expressions, in fact, during the time I used it – I never stumbled upon a regular expression created by this library or one that the library needed me to code, however, you can start with parsing your text right away. If you’re not familiar with the topic of parsing, you can read a very good introductory article by Paul McGuire here, where he also explains the formal declaration of parsing grammars known as Backus-Naur form (BNF).

The same author also wrote this wonderful beginner’s book which includes about 90% of all you’ll need to know on parsing with pyparsing.

There is no better way of presenting this one than with an example. Allow me the pleasure to have a non-sense windows .ini configuration file in the manner of (monty) python:


time = 4

names = idle, gilliam

Now,  the problem is how to fetch this into a useful python dictionary? We notice that the configuration file is separated in 3 namespaces (db, timeout, users), and each of them contains one or more definition lines that contain the literal “=”. How does pyparsing work? It works by creating different grammars for all the elements in the texts and later combining and grouping them in one unified grammar. Maybe also defining specific parse actions or setting names. Let’s go on with a “hello world” example:

from pyparsing import Word, alphas

grammar = Word(alphas)
tokens = grammar.parseString("hello!")
print tokens

– result –

You can see that the exclamation point did not enter the resulted token since the grammar expression is just a word with alphabet characters.

Let’s dive into our problem. Each of the three namespaces has a header in brackets. We will define it as:

word = Word(alphas)
header = Suppress("[")+word.setResultsName("header")+Suppress("]")+LineEnd()

Of course, all the new names you’ll have to import from pyparsing (Suppress, LineEnd and later some others). We first defined a word grammar because we will use it again later. Suppress will tell the grammar not to include this expression in the results, thus, preventing the clutter of brackets in the end. One nice thing about this is the .setResultsName() method that enables referencing specific name from the resulting tokens.

We see that all the definition lines are sepatated with “=” and on the left side is the definer which is a simple word. The values on the right side, however, are varying, and are one of the following: word, list of worlds, number, ip. Thus, we have the following grammars:

number = Word(nums)
list_of_words = Group(ZeroOrMore(word + Suppress(",")) + word)
ip_field = Word(nums, max=3)
ip = Combine(ip_field+"."+ip_field+"."+ip_field+"."+ip_field)

definer = word.setResultsName("definer")
value = Or([number, word, list_of_words, ip]).setResultsName("value")

Here, we can see how to build list of stuff separated with something (ZeroOrMore) and combining tokens into one (Combine). The Word grammar object has parameters for limitations of its definition like max in this example, but also exact, bodyChars and min. Also, as our right side in the definition line varies, we use the Or expression builder. Of course, there is also And().

Now, we are moving to the finalization of our parser. We have all our elementary building blocks needed (header, definer, value), so, we can build the more complex ones. This is how:

definition_line = Group(definer+Suppress("=")+value+LineEnd())
namespace = Group(header+\
all = OneOrMore(namespace).setResultsName("namespaces")

Now, what have we done here? Let’s review from top to bottom. The complete grammar defined as all consists of one or more (OneOrMore() ) namespaces. Each namespace consists of header and one or more definition lines. And each definition line consists of a definer and a value. We added some Group() clauses as well as .setResultsName() on the parts we liked to name our result – and we are ready to parse! Get on with it!

result = all.parseString(content)

Huh? That’s it?!? Yes. Our result is neatly placed in a nice tree structure we can traverse with the attributes we set with .setResultsName(). You can check it with these:

for namespace in result.namespaces:
....print namespace.header
....for definition_line in namespace.definition_lines:
........print definition_line.definer
........print definition_line.value

Of course, I won’t be kind enough to present you with the complete parser we just built here. Fetch the content from a file and do proper (non-wild) import of all the building parts from pyparsing.

What is so great about pyparsing? Well, we don’t have to learn regular expressions. We got out results in a nice data structure. The library is in python and it is very easy to dive into it. But here is the greatest asset: it is readable – you can go back to the code at any time and modify it with ease!

Categories: programming Tags: , ,
  1. 2 May 2010 at 01:04

    Wow! Thanks for such glowing praise for pyparsing! Let me pass along some style points that I’ve started to settle on as I’ve responded to other posters’ questions and blog articles:

    – Punctuation – I’ve found that I don’t much like terms like “Suppress(‘=’)” sprinkled all about my grammar definition. So in my most recent posted work, I usually start off my grammar with a single line that defines symbols for all of those to-be-suppressed punctuation marks, delimiters, etc. For your parser, this would look like:

    LBRACK,RBRACK,EQ = map(Suppress,”[]=”)

    Now all the references keep most of their readability, while doing the proper token suppression.

    – delimited lists – a list of items separated by commas (or by *some* delimiter) is so common that pyparsing includes the delimitedList helper method, with the default delimiter being the comma. So in place of:

    list_of_words = Group(ZeroOrMore(word + Suppress(“,”)) + word)

    you can just write:

    list_of_words = Group(delimitedList(word))

    Also note that delimitedList will accept a “list” containing a single item (as does your own list_of_words). So later on, your value expression does not need to match for both word and list_of_words.

    – setResultsName – This function was one that I added early on in pyparsing, because I *really* wanted an easy way to get at the tokens matching specific parts of the grammar. Your parser is a nice example showing how you iterate over the different parsed pieces using the object attribute access form (my personal favorite as well). But I wasn’t happy with how verbose the name of this method is.

    You can make your parser a little easier to read if you use the newer form of setResultsName, which is to simply follow each to-be-named expression with (“the-name”). For instance, your definition expression changes from:

    definer = word.setResultsName(“definer”)
    value = Or([number, word, list_of_words, ip]).setResultsName(“value”)


    definer = word(“definer”)
    value = Or([number, word, list_of_words, ip])(“value”)

    I also usually add results names to primitive terms not at their base definition, but as they are composed into larger expressions. This way, a basic expression (like “integer”) could be reused for various different parts in other expressions, more like this:

    definition_line = Group(word(“definer”) + EQ + value(“value”) + LineEnd())

    (See that the special definition for definer is no longer needed.) But either form will work just fine.

    – operator overloading – This is an API feature that can get abused, and I may be flirting with the boundaries of taste in pyparsing, with the latest additions of support for ‘-‘, ‘*’, and ‘&’. But I had done a primitive version of pyparsing a loooooong time ago using Java, and with no op overloading available in that language, constructs like “Or([number, word, list_of_words, ip])” are unavoidable. I chose to embrace this feature of Python, and so with pyparsing you can write an expression equivalent to your value expression as “(number ^ word ^ list_of_words ^ ip)”, and the ‘^’ operator returns Or expressions. “Or” performs a longest match, which is important so as to remove the ambiguity in matching the leading number in an IP address; however, to find the longest match, Or must try to evaluate *all* of the options in order to see which comes out longest. If you take care to order your tests from most-restrictive to least-restrictive, you can take advantage of the more efficient MatchFirst. Using the ‘|’ operator to generate MatchFirst expressions, you could write value as “(ip | number | list_of_words | word)” (Note that I changed the order so that we test for an IP address before a lone integer, so as not to accidentally treat the leading component of an IP address like “” as the integer 192, and then get confused by the remaining “.168.0.1” part.)

    I hope you continue to enjoy working with pyparsing – please post other comments or questions back to the pyparsing wiki!

    — Paul

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: