Posts Tagged ‘pyparsing’

Pattern extraction with pyparsing

My previous post that had to do with pyparsing considered exact parsing grammars, that is, parsers that are applied to completely structured text, such as configuration files or logs. But sometimes there is a need to look for patterns in files, and you can also use pyparsing for this end. The key difference is using the pyparsing grammar objects’ scanString() method instead of parseString() method.

I wrote two convenience functions that use these methods:

def match_one(grammar, text):
.... try:
........ match, start, end = grammar.scanString(text).next()
........ return text[:start].strip(), match, text[end:].strip()
.... except StopIteration:
........ # no match found
........ return text, None, None

This one searches for grammar (pattern) in the text, then returns the texts before the match, the match itself, and the text after the match. It only looks for one occurrence of the pattern in the text. The scanString() method itself returns a generator object that yields tuples of the match (as pyparsing result objects), the start position and the end position of the match in the text.

Here is an example on looking numbers in text:

grammar = Word(nums)
text = 'some 3 text 45 in'
print match_one(grammar, text)

It returns ‘some’, ([‘3’], {}), ‘text 45 in’.

The other example is this one:

def search(grammar, text):
.... start_=0
.... result = []
.... for match, start, end in grammar.scanString(text):
........ result.append((text[start_:start].strip(), match))
........ start_=end
.... result.append((text[start_:].strip(),None))
.... return result

This searches for all occurrences in the text, giving the text between the matches and the match itself, as a tuples. The last item in the list is the remainder and None. Here is the sample from above:

print search(grammar, text)
[('some', (['3'], {})), ('text', (['45'], {})), ('in', None)]

These are especially helpful if you need a lexical extraction or text highlighting. They can process whole batches of text in a few moments.

What is so great about pyparsing?

19 April 2010 1 comment


simple .ini parser with pyparsing

The answer to the question above is: readable regular expressions. Code readability is probably the most common reason (or one of the most common reasons) why someone decides to code something in python. When you come to an area of regular expressions, no matter how good you are at writing them, the problem of revisiting, refactoring and modifying is always quite big. It is simple – the regular expressions, no matter how much powerful, are quite unreadable. Pyparsing deals with this problem.

Pyparsing is a library completely written in python that provides a set of classes and utilities for building a readable grammar expression for parsing of extremely complex structured texts. You don’t need to know regular expressions, in fact, during the time I used it – I never stumbled upon a regular expression created by this library or one that the library needed me to code, however, you can start with parsing your text right away. If you’re not familiar with the topic of parsing, you can read a very good introductory article by Paul McGuire here, where he also explains the formal declaration of parsing grammars known as Backus-Naur form (BNF).

The same author also wrote this wonderful beginner’s book which includes about 90% of all you’ll need to know on parsing with pyparsing.

There is no better way of presenting this one than with an example. Allow me the pleasure to have a non-sense windows .ini configuration file in the manner of (monty) python:


time = 4

names = idle, gilliam

Now,  the problem is how to fetch this into a useful python dictionary? We notice that the configuration file is separated in 3 namespaces (db, timeout, users), and each of them contains one or more definition lines that contain the literal “=”. How does pyparsing work? It works by creating different grammars for all the elements in the texts and later combining and grouping them in one unified grammar. Maybe also defining specific parse actions or setting names. Let’s go on with a “hello world” example:

from pyparsing import Word, alphas

grammar = Word(alphas)
tokens = grammar.parseString("hello!")
print tokens

– result –

You can see that the exclamation point did not enter the resulted token since the grammar expression is just a word with alphabet characters.

Let’s dive into our problem. Each of the three namespaces has a header in brackets. We will define it as:

word = Word(alphas)
header = Suppress("[")+word.setResultsName("header")+Suppress("]")+LineEnd()

Of course, all the new names you’ll have to import from pyparsing (Suppress, LineEnd and later some others). We first defined a word grammar because we will use it again later. Suppress will tell the grammar not to include this expression in the results, thus, preventing the clutter of brackets in the end. One nice thing about this is the .setResultsName() method that enables referencing specific name from the resulting tokens.

We see that all the definition lines are sepatated with “=” and on the left side is the definer which is a simple word. The values on the right side, however, are varying, and are one of the following: word, list of worlds, number, ip. Thus, we have the following grammars:

number = Word(nums)
list_of_words = Group(ZeroOrMore(word + Suppress(",")) + word)
ip_field = Word(nums, max=3)
ip = Combine(ip_field+"."+ip_field+"."+ip_field+"."+ip_field)

definer = word.setResultsName("definer")
value = Or([number, word, list_of_words, ip]).setResultsName("value")

Here, we can see how to build list of stuff separated with something (ZeroOrMore) and combining tokens into one (Combine). The Word grammar object has parameters for limitations of its definition like max in this example, but also exact, bodyChars and min. Also, as our right side in the definition line varies, we use the Or expression builder. Of course, there is also And().

Now, we are moving to the finalization of our parser. We have all our elementary building blocks needed (header, definer, value), so, we can build the more complex ones. This is how:

definition_line = Group(definer+Suppress("=")+value+LineEnd())
namespace = Group(header+\
all = OneOrMore(namespace).setResultsName("namespaces")

Now, what have we done here? Let’s review from top to bottom. The complete grammar defined as all consists of one or more (OneOrMore() ) namespaces. Each namespace consists of header and one or more definition lines. And each definition line consists of a definer and a value. We added some Group() clauses as well as .setResultsName() on the parts we liked to name our result – and we are ready to parse! Get on with it!

result = all.parseString(content)

Huh? That’s it?!? Yes. Our result is neatly placed in a nice tree structure we can traverse with the attributes we set with .setResultsName(). You can check it with these:

for namespace in result.namespaces:
....print namespace.header
....for definition_line in namespace.definition_lines:
........print definition_line.definer
........print definition_line.value

Of course, I won’t be kind enough to present you with the complete parser we just built here. Fetch the content from a file and do proper (non-wild) import of all the building parts from pyparsing.

What is so great about pyparsing? Well, we don’t have to learn regular expressions. We got out results in a nice data structure. The library is in python and it is very easy to dive into it. But here is the greatest asset: it is readable – you can go back to the code at any time and modify it with ease!

Categories: programming Tags: , ,

For starters…

Sometimes a simple google search simply can not give you the answers you are looking for. You might have to look up a book reference, browse through some blogs (the forums are not that useful no more), and, at times, even the official documentation is not enough. I recently read a very thoughtful post which underlines (among the multitude of things) the importance of the negative feedback. And, a few years ago, I learned from a very good physics lecturer that the best way to learn about a topic is to write a book on it.

Well, I’m not going to write a book, but I can try to write a blog. Its intention is not to teach someone on the topics on programming, (mostly on python and linguistic processing), but to learn along the path of writing it and to learn from the negative feedback.

I recently started working on linguistic processing. I use pyparsing and nltk for it.

I like using werkzeug and jinja2 for a web interface. I use sqlalchemy to talk to a database. Goodbye sql.

A friend of mine who has quite an experience in this field recommended django. I was hesitating but decided to go the the loosely coupled option since I like jinja2 better than django templates and I like sqlalchemy more than the django orm. The convenience of the other django stuff is mostly unnecessary.

That’s it for now, stay tuned if any of the attached tags interests you.