Archive

Posts Tagged ‘sphinxsearch’

Try to use the relational database to its full potential

24 September 2010 Leave a comment

or:

How the search engines did not do for me what an RDBMS did.

It will be a quick post however.

For my latest project I have highly relational schema for a database on a PostgreSQL server. The search API steadily grew more complicated and I finally had the chance to install and utilize the Apache Solr search engine. To simplify matters for this post, let’s say that for each document I had a short title, a category, two full-text fields, and two different sets of tags. I had a pretty good idea how to set up the solr, its data-import-handler, how the delta import will work, how the schema will be designed, even a faceting strategy for my web interface. I did a lot of work for some of this, and then I got stuck.

You can not query the solr fields with: *text*.

This was a fail. It burned my time. I understand why the solr people didn’t implement such thing. As a user – I would not want to search for part of word inside words, however, this was not only something the client insisted on, the nature of the application demanded such lookups.  I had to have *text* queries.

A colleague of mine heavily insists on sphinxsearch. I am aware it has good performances, and you can also work-out different sorts of matching, *text* included. But very soon I gave up – it had a very bad support for delta imports.

Then I decided to use a brute force. I created an algorithm that put the multivalues field in one field and a method for querying it, and then put all the documents to be indexed in one big postgresql table, one row per document. Made a cron job for delta-updates on documents, and switch the search API from the complex joins to simple query of the big table. Of course, I made a whole bunch of different simple and composite indices on the big table for this occasion after the initial population.

One can not expect the same speed of querying as with solr, but the speed is satisfactory. And, most importantly, I retained the possibility to query what I like, and update in optimized manner. Finally, I suppose the point of this task was that you should not rush things and use more technologies in your product, just try to use tools you already have, in this case your RDBMS. Sometimes even the most complicated problems have very simple solutions.