Archive for July, 2010

Pattern extraction with pyparsing

My previous post that had to do with pyparsing considered exact parsing grammars, that is, parsers that are applied to completely structured text, such as configuration files or logs. But sometimes there is a need to look for patterns in files, and you can also use pyparsing for this end. The key difference is using the pyparsing grammar objects’ scanString() method instead of parseString() method.

I wrote two convenience functions that use these methods:

def match_one(grammar, text):
.... try:
........ match, start, end = grammar.scanString(text).next()
........ return text[:start].strip(), match, text[end:].strip()
.... except StopIteration:
........ # no match found
........ return text, None, None

This one searches for grammar (pattern) in the text, then returns the texts before the match, the match itself, and the text after the match. It only looks for one occurrence of the pattern in the text. The scanString() method itself returns a generator object that yields tuples of the match (as pyparsing result objects), the start position and the end position of the match in the text.

Here is an example on looking numbers in text:

grammar = Word(nums)
text = 'some 3 text 45 in'
print match_one(grammar, text)

It returns ‘some’, ([‘3’], {}), ‘text 45 in’.

The other example is this one:

def search(grammar, text):
.... start_=0
.... result = []
.... for match, start, end in grammar.scanString(text):
........ result.append((text[start_:start].strip(), match))
........ start_=end
.... result.append((text[start_:].strip(),None))
.... return result

This searches for all occurrences in the text, giving the text between the matches and the match itself, as a tuples. The last item in the list is the remainder and None. Here is the sample from above:

print search(grammar, text)
[('some', (['3'], {})), ('text', (['45'], {})), ('in', None)]

These are especially helpful if you need a lexical extraction or text highlighting. They can process whole batches of text in a few moments.