Casual mysql fail – careful with that group_concat
I was doing a minor maintenance job on a non-normalized table which required large grouping and gathering of id’s, which in turn were to be processed by other routine. Each group would returned a comma-separated list using mysql’s group_concat. Yet, some groups failed to do so, resulting with strings that ended with a comma (this was a bug that was very hard to trace, actually).
As it turns out, the group_concat resulting string has a limit! Now, let’s say for a moment that this was acceptable RDBMS design (it is not!) – the server does not throw exception on exceeding this limit, nor does it give any warning – which was the reason for some tough debugging. It simply truncated the result in a shorter list!
That is plain wrong!
In no single imaginable scenario an aggregate function is allowed to return false result. I really don’t care if you swap-up my server or wait an hour – the most important thing is that my query gets the correct data. No optimization is ever more valuable than reliability and getting the right data.
Try to use the relational database to its full potential
or:
How the search engines did not do for me what an RDBMS did.
It will be a quick post however.
For my latest project I have highly relational schema for a database on a PostgreSQL server. The search API steadily grew more complicated and I finally had the chance to install and utilize the Apache Solr search engine. To simplify matters for this post, let’s say that for each document I had a short title, a category, two full-text fields, and two different sets of tags. I had a pretty good idea how to set up the solr, its data-import-handler, how the delta import will work, how the schema will be designed, even a faceting strategy for my web interface. I did a lot of work for some of this, and then I got stuck.
You can not query the solr fields with: *text*.
This was a fail. It burned my time. I understand why the solr people didn’t implement such thing. As a user – I would not want to search for part of word inside words, however, this was not only something the client insisted on, the nature of the application demanded such lookups. I had to have *text* queries.
A colleague of mine heavily insists on sphinxsearch. I am aware it has good performances, and you can also work-out different sorts of matching, *text* included. But very soon I gave up – it had a very bad support for delta imports.
Then I decided to use a brute force. I created an algorithm that put the multivalues field in one field and a method for querying it, and then put all the documents to be indexed in one big postgresql table, one row per document. Made a cron job for delta-updates on documents, and switch the search API from the complex joins to simple query of the big table. Of course, I made a whole bunch of different simple and composite indices on the big table for this occasion after the initial population.
One can not expect the same speed of querying as with solr, but the speed is satisfactory. And, most importantly, I retained the possibility to query what I like, and update in optimized manner. Finally, I suppose the point of this task was that you should not rush things and use more technologies in your product, just try to use tools you already have, in this case your RDBMS. Sometimes even the most complicated problems have very simple solutions.
I got an answer from David Gross
Today I was very surprised to discover that I got a video answer on YouTube by the renowned physicist and Nobel Laureate David Gross – as a part of nobelprize.org event “Ask a Nobel Laureate”. I found about this event on its facebook page, but at first didn’t gave it too much attention since there we’re a lot of non-sense questions all along with pseudo-science and space fiction. But later I decided to give it a try: I am a physicist, I know a great deal about the Standard Model, I understand what the guys at the Large Hadron Collider (LHC) are doing and what I actually asked was a solid interest of mine. Here is my question [look for Baze Petrushev]:
Suppose we fail at discovering the Higgs boson and SUSY (supersymmetry) at LHC – do physicists have an alternative theory on mass generation and supersymmetry? And if not, will the data from LHC be enough for physicists to invent a replacement of (part of) the Standard Model?
This was part of my concern on sometimes overwhelming self-confidence that physicists have in the established science models. Lord Kelvin for example, was strongly convinced that in his time, only a minuscule part of the knowledge of the natural world was unknown to mankind. I guess that since his time, the world-view of the science community had changed, as the myriad of beautiful strange theories emerged.
After all, the Standard Model had survived the test of time many times, as it became the most precise theory, along with the General Theory of Relativity. And even if its part with the Higgs boson and/or supersymmetry fails, the good guys will find a way to fix it. At least, this is what David Gross ensures me in his answer:
What happens if the LHC fails to find the Higgs boson?
Pattern extraction with pyparsing
My previous post that had to do with pyparsing considered exact parsing grammars, that is, parsers that are applied to completely structured text, such as configuration files or logs. But sometimes there is a need to look for patterns in files, and you can also use pyparsing for this end. The key difference is using the pyparsing grammar objects’ scanString() method instead of parseString() method.
I wrote two convenience functions that use these methods:
def match_one(grammar, text):
try:
match, start, end = grammar.scanString(text).next()
return text[:start].strip(), match, text[end:].strip()
except StopIteration:
# no match found
return text, None, None
This one searches for grammar (pattern) in the text, then returns the texts before the match, the match itself, and the text after the match. It only looks for one occurrence of the pattern in the text. The scanString() method itself returns a generator object that yields tuples of the match (as pyparsing result objects), the start position and the end position of the match in the text.
Here is an example on looking numbers in text:
grammar = Word(nums)
text = 'some 3 text 45 in'
print match_one(grammar, text)
It returns ‘some’, ([‘3’], {}), ‘text 45 in’.
The other example is this one:
def search(grammar, text):
start_=0
result = []
for match, start, end in grammar.scanString(text):
result.append((text[start_:start].strip(), match))
start_=end
result.append((text[start_:].strip(),None))
return result
This searches for all occurrences in the text, giving the text between the matches and the match itself, as a tuples. The last item in the list is the remainder and None. Here is the sample from above:
print search(grammar, text)
[('some', (['3'], {})), ('text', (['45'], {})), ('in', None)]
These are especially helpful if you need a lexical extraction or text highlighting. They can process whole batches of text in a few moments.
On sqlalchemy – what do you put in your base model?
A reflective approach on sqlalchemy usage
What is so great about pyparsing?
or:
simple .ini parser with pyparsing
The answer to the question above is: readable regular expressions. Code readability is probably the most common reason (or one of the most common reasons) why someone decides to code something in python. When you come to an area of regular expressions, no matter how good you are at writing them, the problem of revisiting, refactoring and modifying is always quite big. It is simple – the regular expressions, no matter how much powerful, are quite unreadable. Pyparsing deals with this problem.
Pyparsing is a library completely written in python that provides a set of classes and utilities for building a readable grammar expression for parsing of extremely complex structured texts. You don’t need to know regular expressions, in fact, during the time I used it – I never stumbled upon a regular expression created by this library or one that the library needed me to code, however, you can start with parsing your text right away. If you’re not familiar with the topic of parsing, you can read a very good introductory article by Paul McGuire here, where he also explains the formal declaration of parsing grammars known as Backus-Naur form (BNF).
The same author also wrote this wonderful beginner’s book which includes about 90% of all you’ll need to know on parsing with pyparsing.
There is no better way of presenting this one than with an example. Allow me the pleasure to have a non-sense windows .ini configuration file in the manner of (monty) python:
[db]
user=eric
pass=idle
[timeout]
ip=127.0.1.1
time = 4
[users]
names = idle, gilliam
Now, the problem is how to fetch this into a useful python dictionary? We notice that the configuration file is separated in 3 namespaces (db, timeout, users), and each of them contains one or more definition lines that contain the literal “=”. How does pyparsing work? It works by creating different grammars for all the elements in the texts and later combining and grouping them in one unified grammar. Maybe also defining specific parse actions or setting names. Let’s go on with a “hello world” example:
from pyparsing import Word, alphas
grammar = Word(alphas)
tokens = grammar.parseString("hello!")
print tokens
– result –
['hello']
You can see that the exclamation point did not enter the resulted token since the grammar expression is just a word with alphabet characters.
Let’s dive into our problem. Each of the three namespaces has a header in brackets. We will define it as:
word = Word(alphas)
header = Suppress("[")+word.setResultsName("header")+Suppress("]")+LineEnd()
Of course, all the new names you’ll have to import from pyparsing (Suppress, LineEnd and later some others). We first defined a word grammar because we will use it again later. Suppress will tell the grammar not to include this expression in the results, thus, preventing the clutter of brackets in the end. One nice thing about this is the .setResultsName() method that enables referencing specific name from the resulting tokens.
We see that all the definition lines are sepatated with “=” and on the left side is the definer which is a simple word. The values on the right side, however, are varying, and are one of the following: word, list of worlds, number, ip. Thus, we have the following grammars:
number = Word(nums)
list_of_words = Group(ZeroOrMore(word + Suppress(",")) + word)
ip_field = Word(nums, max=3)
ip = Combine(ip_field+"."+ip_field+"."+ip_field+"."+ip_field)
definer = word.setResultsName("definer")
value = Or([number, word, list_of_words, ip]).setResultsName("value")
Here, we can see how to build list of stuff separated with something (ZeroOrMore) and combining tokens into one (Combine). The Word grammar object has parameters for limitations of its definition like max in this example, but also exact, bodyChars and min. Also, as our right side in the definition line varies, we use the Or expression builder. Of course, there is also And().
Now, we are moving to the finalization of our parser. We have all our elementary building blocks needed (header, definer, value), so, we can build the more complex ones. This is how:
definition_line = Group(definer+Suppress("=")+value+LineEnd())
namespace = Group(header+\
OneOrMore(definition_line).setResultsName("definition_lines"))
all = OneOrMore(namespace).setResultsName("namespaces")
Now, what have we done here? Let’s review from top to bottom. The complete grammar defined as all consists of one or more (OneOrMore() ) namespaces. Each namespace consists of header and one or more definition lines. And each definition line consists of a definer and a value. We added some Group() clauses as well as .setResultsName() on the parts we liked to name our result – and we are ready to parse! Get on with it!
result = all.parseString(content)
Huh? That’s it?!? Yes. Our result is neatly placed in a nice tree structure we can traverse with the attributes we set with .setResultsName(). You can check it with these:
for namespace in result.namespaces:
print namespace.header
for definition_line in namespace.definition_lines:
print definition_line.definer
print definition_line.value
Of course, I won’t be kind enough to present you with the complete parser we just built here. Fetch the content from a file and do proper (non-wild) import of all the building parts from pyparsing.
What is so great about pyparsing? Well, we don’t have to learn regular expressions. We got out results in a nice data structure. The library is in python and it is very easy to dive into it. But here is the greatest asset: it is readable – you can go back to the code at any time and modify it with ease!
For starters…
Sometimes a simple google search simply can not give you the answers you are looking for. You might have to look up a book reference, browse through some blogs (the forums are not that useful no more), and, at times, even the official documentation is not enough. I recently read a very thoughtful post which underlines (among the multitude of things) the importance of the negative feedback. And, a few years ago, I learned from a very good physics lecturer that the best way to learn about a topic is to write a book on it.
Well, I’m not going to write a book, but I can try to write a blog. Its intention is not to teach someone on the topics on programming, (mostly on python and linguistic processing), but to learn along the path of writing it and to learn from the negative feedback.
I recently started working on linguistic processing. I use pyparsing and nltk for it.
I like using werkzeug and jinja2 for a web interface. I use sqlalchemy to talk to a database. Goodbye sql.
A friend of mine who has quite an experience in this field recommended django. I was hesitating but decided to go the the loosely coupled option since I like jinja2 better than django templates and I like sqlalchemy more than the django orm. The convenience of the other django stuff is mostly unnecessary.
That’s it for now, stay tuned if any of the attached tags interests you.
Recent Comments