Archive

Author Archive

Casual mysql fail – careful with that group_concat

4 October 2010 1 comment

I was doing a minor maintenance job on a non-normalized table which required large grouping and gathering of id’s, which in turn were to be processed by other routine. Each group would returned a comma-separated list using mysql’s group_concat. Yet, some groups failed to do so, resulting with strings that ended with a comma (this was a bug that was very hard to trace, actually).

As it turns out, the group_concat resulting string has a limit! Now, let’s say for a moment that this was acceptable RDBMS design (it is not!) – the server does not throw exception on exceeding this limit, nor does it give any warning – which was the reason for some tough debugging. It simply truncated the result in a shorter list!

That is plain wrong!

In no single imaginable scenario an aggregate function is allowed to return false result. I really don’t care if you swap-up my server or wait an hour – the most important thing is that my query gets the correct data. No optimization is ever more valuable than reliability and getting the right data.

Categories: programming Tags: , ,

Try to use the relational database to its full potential

24 September 2010 Leave a comment

or:

How the search engines did not do for me what an RDBMS did.

It will be a quick post however.

For my latest project I have highly relational schema for a database on a PostgreSQL server. The search API steadily grew more complicated and I finally had the chance to install and utilize the Apache Solr search engine. To simplify matters for this post, let’s say that for each document I had a short title, a category, two full-text fields, and two different sets of tags. I had a pretty good idea how to set up the solr, its data-import-handler, how the delta import will work, how the schema will be designed, even a faceting strategy for my web interface. I did a lot of work for some of this, and then I got stuck.

You can not query the solr fields with: *text*.

This was a fail. It burned my time. I understand why the solr people didn’t implement such thing. As a user – I would not want to search for part of word inside words, however, this was not only something the client insisted on, the nature of the application demanded such lookups.  I had to have *text* queries.

A colleague of mine heavily insists on sphinxsearch. I am aware it has good performances, and you can also work-out different sorts of matching, *text* included. But very soon I gave up – it had a very bad support for delta imports.

Then I decided to use a brute force. I created an algorithm that put the multivalues field in one field and a method for querying it, and then put all the documents to be indexed in one big postgresql table, one row per document. Made a cron job for delta-updates on documents, and switch the search API from the complex joins to simple query of the big table. Of course, I made a whole bunch of different simple and composite indices on the big table for this occasion after the initial population.

One can not expect the same speed of querying as with solr, but the speed is satisfactory. And, most importantly, I retained the possibility to query what I like, and update in optimized manner. Finally, I suppose the point of this task was that you should not rush things and use more technologies in your product, just try to use tools you already have, in this case your RDBMS. Sometimes even the most complicated problems have very simple solutions.

I got an answer from David Gross

Today I was very surprised to discover that I got a video answer on YouTube by the renowned physicist and Nobel Laureate David Gross – as a part of nobelprize.org event “Ask a Nobel Laureate”. I found about this event on its facebook page, but at first didn’t gave it too much attention since there we’re a lot of non-sense questions all along with pseudo-science and space fiction. But later I decided to give it a try: I am a physicist, I know a great deal about the Standard Model, I understand what the guys at the Large Hadron Collider (LHC) are doing and what I actually asked was a solid interest of mine. Here is my question [look for Baze Petrushev]:

Suppose we fail at discovering the Higgs boson and SUSY (supersymmetry) at LHC – do physicists have an alternative theory on mass generation and supersymmetry? And if not, will the data from LHC be enough for physicists to invent a replacement of (part of) the Standard Model?

This was part of my concern on sometimes overwhelming self-confidence that physicists have in the established science models. Lord Kelvin for example, was strongly convinced that in his time, only a minuscule part of the knowledge of the natural world was unknown to mankind. I guess that since his time, the world-view of the science community had changed, as the myriad of beautiful strange theories emerged.

After all, the Standard Model had survived the test of time many times, as it became the most precise theory, along with the General Theory of Relativity. And even if its part with the Higgs boson and/or supersymmetry fails,  the good guys will find a way to fix it. At least, this is what David Gross ensures me in his answer:

What happens if the LHC fails to find the Higgs boson?

Pattern extraction with pyparsing

My previous post that had to do with pyparsing considered exact parsing grammars, that is, parsers that are applied to completely structured text, such as configuration files or logs. But sometimes there is a need to look for patterns in files, and you can also use pyparsing for this end. The key difference is using the pyparsing grammar objects’ scanString() method instead of parseString() method.

I wrote two convenience functions that use these methods:


def match_one(grammar, text):
.... try:
........ match, start, end = grammar.scanString(text).next()
........ return text[:start].strip(), match, text[end:].strip()
.... except StopIteration:
........ # no match found
........ return text, None, None

This one searches for grammar (pattern) in the text, then returns the texts before the match, the match itself, and the text after the match. It only looks for one occurrence of the pattern in the text. The scanString() method itself returns a generator object that yields tuples of the match (as pyparsing result objects), the start position and the end position of the match in the text.

Here is an example on looking numbers in text:


grammar = Word(nums)
text = 'some 3 text 45 in'
print match_one(grammar, text)

It returns ‘some’, ([‘3’], {}), ‘text 45 in’.

The other example is this one:


def search(grammar, text):
.... start_=0
.... result = []
.... for match, start, end in grammar.scanString(text):
........ result.append((text[start_:start].strip(), match))
........ start_=end
.... result.append((text[start_:].strip(),None))
.... return result

This searches for all occurrences in the text, giving the text between the matches and the match itself, as a tuples. The last item in the list is the remainder and None. Here is the sample from above:


print search(grammar, text)
[('some', (['3'], {})), ('text', (['45'], {})), ('in', None)]

These are especially helpful if you need a lexical extraction or text highlighting. They can process whole batches of text in a few moments.

On sqlalchemy – what do you put in your base model?

My previous post regarded the reflective use of sqlalchemy, and there I showed how to set up a codebase for BaseModel and schema mapper. I put only one method in my BaseModel, and that was actually the property session, which returns the session that holds the instance (None if the instance is not added).

Now, before I continue, just a short description of this test database. The director table holds the movie directors, lastname is required and unique. The movie table holds the movies, movie field is required and unique, director_id is the link to the director (can be null, for unknown/not-added director), and year – also optional. The genre table holds just the genre names, and the movie_genre is only association table. Before you continue, you should check the mappers we added in the previous post.

There is an easy way to view all the information packed in a list with a coupled query, but we’ll do it the sqlalchemy’s way, that is, we’ll query it with its orm.

from sqlalchemy.orm import eagerload_all
movies=session.query(Movie)\
...............options(eagerload_all(Movie.director, Movie.genres))\
...............limit(10).all()

Notice that we added an option called eagerload_all that specifies which mapped properties of the queried objects should be loaded on the fly into the sessions – the so-called eagerloading. Use this if and only if you are certain that you will need this properties. We added limit of 10 only for clarity. If I print these objects like this:

for m in movies:
....print "\"%s\" (%s) by %s %s (%s)" \
..........% (m.movie, m.year, m.director.firstname, m.director.lastname,
..........", ".join([g.genre for g in m.genres]))

I get my sample database:

"A Clockwork Orange" (1971) by Stanley Kubrick (Sci-fi)
"Dr. Strangelove" (None) by Stanley Kubrick (Comedy)
"2001: A Space Oddisey" (1969) by Stanley Kubrick (Post-modern, Sci-fi)
"Lolita" (None) by Stanley Kubrick ()
"Shadow of a Doubt" (None) by Alfred Hitchcock (Noir, Mystery)
"Vertigo" (1957) by Alfred Hitchcock (Mystery)
"Citizen Kane" (1941) by Orson Welles ()
"Europa" (1991) by Lars von Trier ()
"Annie Hall" (None) by Woody Allen (Comedy)
"Element of Crime" (1984) by Lars von Trier ()

Now that we have a test data, I’ll continue with my BaseModel.

There is always the question of what kind of foreign key we should use. We can agree that incremental unsigned integers for every table is a good way of efficiently linking the tables, removing the complexity of using composite foreign keys or keys with slow look-up, but sometimes people tend to put these id-s into the application. Surely, the director with id 4 has bears no information, but when you say director ‘Hitchcock’ you get a meaningful data in your application. So, my approach is – keep the id-s in your database and out of the application. Use ‘natural’ key to refer to records from the application.

If I proceed in this manner, it would be a good thing to have a method load() associated to each model that can return a unique record/object (if exists) given a set of arguments. Retrieving unique records with sqlalchemy is very easy, sqlalchemy’s quering mechanism has a method one() that in the background makes a LIMIT 2: if there are two records found – throws MultipleResultsFound; if there are no results – throws NoResultsFound; and if it finds one – it returns it. Here’s how:

d = session.query(Director)\
............filter(Director.lastname=='Hitchcock').one()
print d.firstname

One genuine feature of the python language are the class methods, which implicitely take the class as the first parameter. So, instead of doing this:

class Director(BaseModel):
....@staticmethod
....def load(session, lastname):
........return session.query(Director)\
.......................filter(Director.lastname==lastname)\
.......................one()

which we ought to do for every model, we can put something more generic in the BaseModel:

....@classmethod
....def load(cls, session, **kwargs):
........q = session.query(cls)
........filters = [getattr(cls, field_name)==kwargs[field_name] \
...................for field_name in kwargs]
........return q.filter(and_(*filters)).one()

With this – we get static load method for all of the models! Sorry, but I don’t think you can do a similar thing in some other language – please note me if one can. The tricky part in this method is the list comprehension, but it is quite obvious, we just ‘compile’ a list of filters for every named argument. For example, for the Movie model, the call:

Movie.load(session, movie='Europa', year=1991)

translates the list comprehension into:

[Movie.movie=='Europa', Movie.year==1991]

Later, we pass the list as an argument-list to and_() (from sqlalchemy.sql.expression) which compiles filter clause of these two filters. The cls argument is actually the Movie class/model, which is implicitly passed when the method is called.

In fact, a very basic method is a class method. Actually, this is also one of the python’s unique features: the __new__ method. This one returns the object if it is not overridden, and if it is, it may not return the object, it may return some specific object, or the default/desired object. If it returns some object, the __init__ method will be called upon the newly created object. This method is actually quite good at constructing the singleton pattern. No one will try to put a BaseModel instance into a session, but anyway, we’ll forbid it:

def __new__(cls):
....if cls==BaseModel:
........raise TypeError, "You can not instantiate a BaseModel"
....return super(BaseModel, cls).__new__(cls)

So, we raise a TypeError if we try to create a BaseModel instance, and if it is an instance from a child class, we simply return such instance.

For deleting a single object (using the session), sqlalchemy provides:

session.delete(obj)

We would want to provide class access for deletion, simply:

def drop(self):
....self.session.delete(self)

where one can see the usefulness of having a session property from the instance. To use:

d = Director.load(session, lastname='Hitchcock')
d.drop()

My next item in the BaseModel is a serializer. I’ll make a json serialize method. I simply will serialize the dictionary of the object without the
_sa_instance_state attribute which can not be encoded. Simply:

def serialize(self):
....tmp = self.__dict__.copy()
....del tmp['_sa_instance_state']
....return simplejson.dumps(tmp)

Finally, I needed a specific method for one model in a project, and it found its way to the BaseModel. Namely, I want to what is the position of a record in a table when I sort by a particular column, and it is actually the count of records which satisfy the condition that their values are smaller than the value of the particular record with regard of the sorting column. I can put this in my BaseModel:

def position(self, order):
....cls = self.__class__
....return self.session.query(cls)\
................filter(getattr(cls, order)<getattr(self, order))\
................count()

The __class__ property of the instance simply returns the particular model.

The next post on sqlalchemy will probably have some methods and properties in the specific models – all done with sqlalchemy’s orm querying – just some stuff I had to leard with trial and error. But I think my next post will deal with pyparsing again, and I will parse in a different way than that time.

A reflective approach on sqlalchemy usage

Sometimes we are forced to work with legacy databases and we don’t have the beautiful green field to create our schema declaratevely. Sqlalchemy provides a very thorough way of reflecting the schema over our bussines logic and mapping the tables and relations with our models.

Say we have a simple database for videostore with according to the following entity relationship diagram.
Videostore ER diagram
(created with dbvis)

I’ll split the code in 2 modules: base and models. The base module will contain the reusable code that we can use with any db schema later on, while the models module will contain the specifics for this schema. We’ll have a simple mysql database with unicode charset.

The base module contains one function and one class. Let’s import some stuff:

from sqlalchemy import MetaData, create_engine
from sqlalchemy.orm import mapper, sessionmaker
from sqlalchemy.orm.session import object_session

We will put our BaseModel class here, and extend the other models from it.

class BaseModel(object):
....@property
....def session(self):
........return object_session(self)

I won’t argue about the usefullness of having a base model. I’ll show what we can put here in the post yet to come. Sqlalchemy gives us the ability to fetch the session that holds an instance of some model by using object_session. Of course, if the object is not added in a session, this will return None. The next is the reflect function:

def reflect(connection_string, models):
....metadata = MetaData()
....metadata.bind = create_engine(connection_string)
....metadata.reflect()
....mappers = {}
....for table_name in metadata.tables:
........model_name = "".join(part.capitalize()\
.............................for part in table_name.split("_"))
........try:
............model = getattr(models, model_name)
........except AttributeError:
............raise NameError, "Model %s not found in module %s" \
..............................%(model_name, repr(models))
........mappers[table_name] = mapper(model, metadata.tables[table_name])
....Session = sessionmaker(metadata.bind, autocommit=False)
....return (mappers, metadata.tables, Session)

In the reflect function, after metadata.reflect() is called, the metadata contains the reflected tables from our schema. We proceed to create mappers over our models. To do this in a more automated manner, let us make a convention that our model names will be simply the names of the tables titlecased, e.g., director table will map to the model Director and movie_genre to MovieGenre. We need to have these models declared in a module that gets passed as a second parameter to the reflect function.

After the iteration over metadata.tables, our models will be mapped to our tables. One thing that also make use of the metadata is the Session class which we generate with the sqlalchemy’s sessionmaker. The autocommit is a matter of personal preference, I personally want to have more control over wheather instances in the session are commited or expunged in my applications. We complete our reflect module by returning the mappers, tables and the session class.

Now we move to models, the code specific to our schema. We will need these:

from sys import modules
from sqlalchemy.orm import relationship
from base import BaseModel, reflect

One should notice the 0.5->0.6 sqlalchemy transition of relation to relationship here. We won’t put any code in the models (maybe in the next post):

class Director(BaseModel):
....pass
class Movie(BaseModel):
....pass
class Genre(BaseModel):
....pass
class MovieGenre(BaseModel):
....pass

And, we define the mappers in a function that is some kind of functional extension to the reflect function:

def map(connection_string):
....models=modules['models']
....mappers, tables, Session = reflect(connection_string, models)
....mappers["director"].add_properties({
........"movies": relationship(models.Movie,
...............................backref="director",
...............................cascade="all, delete-orphan")
....})
....mappers["movie"].add_properties({
........"genres": relationship(models.Genre,
...............................backref="movies",
...............................secondary=tables['movie_genre'])
....})
....return (mappers, tables, Session)

Notice that we got the ‘models’ module (needed for the second argument in the reflect function) from the modules[] – it is this module itself. One can argue about how to implement/extend the relationships, but the point is that we want an easy access for getting a session, here being simply:

session = map(conn_str)[2]()

Finnaly, I want to stress that this is not meant to be code that you simply put in production environment, but to outline a way to get the most from already build database schema using sqlalchemy. After all, we live and work in a world where you can not rely on your current database setup and have to make frequent changes to the schema. Declarative way of doing things in unaffordable.

Categories: programming Tags: , , ,

What is so great about pyparsing?

19 April 2010 1 comment

or:

simple .ini parser with pyparsing

The answer to the question above is: readable regular expressions. Code readability is probably the most common reason (or one of the most common reasons) why someone decides to code something in python. When you come to an area of regular expressions, no matter how good you are at writing them, the problem of revisiting, refactoring and modifying is always quite big. It is simple – the regular expressions, no matter how much powerful, are quite unreadable. Pyparsing deals with this problem.

Pyparsing is a library completely written in python that provides a set of classes and utilities for building a readable grammar expression for parsing of extremely complex structured texts. You don’t need to know regular expressions, in fact, during the time I used it – I never stumbled upon a regular expression created by this library or one that the library needed me to code, however, you can start with parsing your text right away. If you’re not familiar with the topic of parsing, you can read a very good introductory article by Paul McGuire here, where he also explains the formal declaration of parsing grammars known as Backus-Naur form (BNF).

The same author also wrote this wonderful beginner’s book which includes about 90% of all you’ll need to know on parsing with pyparsing.

There is no better way of presenting this one than with an example. Allow me the pleasure to have a non-sense windows .ini configuration file in the manner of (monty) python:

[db]
user=eric
pass=idle

[timeout]
ip=127.0.1.1
time = 4

[users]
names = idle, gilliam

Now,  the problem is how to fetch this into a useful python dictionary? We notice that the configuration file is separated in 3 namespaces (db, timeout, users), and each of them contains one or more definition lines that contain the literal “=”. How does pyparsing work? It works by creating different grammars for all the elements in the texts and later combining and grouping them in one unified grammar. Maybe also defining specific parse actions or setting names. Let’s go on with a “hello world” example:

from pyparsing import Word, alphas

grammar = Word(alphas)
tokens = grammar.parseString("hello!")
print tokens

– result –
['hello']

You can see that the exclamation point did not enter the resulted token since the grammar expression is just a word with alphabet characters.

Let’s dive into our problem. Each of the three namespaces has a header in brackets. We will define it as:

word = Word(alphas)
header = Suppress("[")+word.setResultsName("header")+Suppress("]")+LineEnd()

Of course, all the new names you’ll have to import from pyparsing (Suppress, LineEnd and later some others). We first defined a word grammar because we will use it again later. Suppress will tell the grammar not to include this expression in the results, thus, preventing the clutter of brackets in the end. One nice thing about this is the .setResultsName() method that enables referencing specific name from the resulting tokens.

We see that all the definition lines are sepatated with “=” and on the left side is the definer which is a simple word. The values on the right side, however, are varying, and are one of the following: word, list of worlds, number, ip. Thus, we have the following grammars:

number = Word(nums)
list_of_words = Group(ZeroOrMore(word + Suppress(",")) + word)
ip_field = Word(nums, max=3)
ip = Combine(ip_field+"."+ip_field+"."+ip_field+"."+ip_field)

definer = word.setResultsName("definer")
value = Or([number, word, list_of_words, ip]).setResultsName("value")

Here, we can see how to build list of stuff separated with something (ZeroOrMore) and combining tokens into one (Combine). The Word grammar object has parameters for limitations of its definition like max in this example, but also exact, bodyChars and min. Also, as our right side in the definition line varies, we use the Or expression builder. Of course, there is also And().

Now, we are moving to the finalization of our parser. We have all our elementary building blocks needed (header, definer, value), so, we can build the more complex ones. This is how:

definition_line = Group(definer+Suppress("=")+value+LineEnd())
namespace = Group(header+\
................OneOrMore(definition_line).setResultsName("definition_lines"))
all = OneOrMore(namespace).setResultsName("namespaces")

Now, what have we done here? Let’s review from top to bottom. The complete grammar defined as all consists of one or more (OneOrMore() ) namespaces. Each namespace consists of header and one or more definition lines. And each definition line consists of a definer and a value. We added some Group() clauses as well as .setResultsName() on the parts we liked to name our result – and we are ready to parse! Get on with it!

result = all.parseString(content)

Huh? That’s it?!? Yes. Our result is neatly placed in a nice tree structure we can traverse with the attributes we set with .setResultsName(). You can check it with these:

for namespace in result.namespaces:
....print namespace.header
....for definition_line in namespace.definition_lines:
........print definition_line.definer
........print definition_line.value

Of course, I won’t be kind enough to present you with the complete parser we just built here. Fetch the content from a file and do proper (non-wild) import of all the building parts from pyparsing.

What is so great about pyparsing? Well, we don’t have to learn regular expressions. We got out results in a nice data structure. The library is in python and it is very easy to dive into it. But here is the greatest asset: it is readable – you can go back to the code at any time and modify it with ease!

Categories: programming Tags: , ,

For starters…

Sometimes a simple google search simply can not give you the answers you are looking for. You might have to look up a book reference, browse through some blogs (the forums are not that useful no more), and, at times, even the official documentation is not enough. I recently read a very thoughtful post which underlines (among the multitude of things) the importance of the negative feedback. And, a few years ago, I learned from a very good physics lecturer that the best way to learn about a topic is to write a book on it.

Well, I’m not going to write a book, but I can try to write a blog. Its intention is not to teach someone on the topics on programming, (mostly on python and linguistic processing), but to learn along the path of writing it and to learn from the negative feedback.

I recently started working on linguistic processing. I use pyparsing and nltk for it.

I like using werkzeug and jinja2 for a web interface. I use sqlalchemy to talk to a database. Goodbye sql.

A friend of mine who has quite an experience in this field recommended django. I was hesitating but decided to go the the loosely coupled option since I like jinja2 better than django templates and I like sqlalchemy more than the django orm. The convenience of the other django stuff is mostly unnecessary.

That’s it for now, stay tuned if any of the attached tags interests you.