Saturday, 12 September 2009

Using the Sphinx Search Engine with MySQL

Using the Sphinx Search Engine with MySQL: "

MySQL Full Text Search Limitations


Suppose you have a MyISAM table containing a column with a full text index. This table starts to grow to a significant size (millions of rows) and gets updated fairly frequently. Chances are that you’ll start to see some bottlenecks when accessing this table, since without row level locking, the reading and writing operations will be blocking each other.


A solution that many people would suggest right away is to use the master for writes and a slave for reads, but this only masks the problem, and it won’t take long before enough read traffic on the slave starts causing slave lags.


Why Sphinx?



The main difference between the Sphinx search engine and other alternatives is its close integration with MySQL. For example, it can be used as a storage engine. In this way, Sphinx’s impact on existing application code can be minimized, opening the door for its more advanced features in future releases.


Initial Considerations


Need to compile MySQL from source


I struggled with this one for a while, until I realized that the storage engine plugins only work if the full version number of the MySQL server matches the full version number of the MySQL libraries used in the compilation; and if compiler options for the server, libraries, and plugin also match.


I couldn’t find any other reliable way of getting everything to work together than grabbing a MySQL source tarball and compiling it with Sphinx SE. The process is straightforward if you follow the procedures in the MySQL’s manual: 2.10.1. Source Installation Overview.


Use MySQL 5.1.xx


Although it is possible to use Sphinx with 5.0.xx, I like the cleaner plug-in storage interface of 5.1.xx.


The table to be indexed needs a unique INT column


The table for which you are going to create a full text index has to have an INT column declared as UNIQUE KEY. This value is the one to which Sphinx has to point the rows containing the text in its indexes.


Putting everything together


Both Sphinx and MySQL have very clear and straightforward instructions on how to compile from the source code tarballs, so just follow the steps. Although I had to run several options until I got all the right pieces in place, compiling either tool was never an issue.


The instructions for installing Sphinx can be found in its online documentation 2.3. Installing Sphinx on Linux; and to compile into MySQL 5.1 in 7.2.2. Compiling MySQL 5.1.x with SphinxSE.


When Sphinx is installed, it has two different sample configuration files. One is very comprehensive and the other is clean, simple, functional, as billed in its header. Use the latter one to get started.


Example run


As always, I use the Amarok player MySQL database as an example and in this case, I used the lyrics table.


In the sphinx.conf file I used the following line to index the table:


 sql_query   = \
SELECT lid, url, lyrics FROM lyrics

Notice that with this configuration line, I’ll be indexing both the song title (the file name, really) and the lyrics text.


After I added an INT AUTO_INCREMENT PRIMARY KEY column to the lyrics table and converted it into InnoDB, the resulting table was:


show create table lyrics\G
*************************** 1. row ***************************
Table: lyrics
Create Table: CREATE TABLE `lyrics` (
`lid` int(11) NOT NULL AUTO_INCREMENT,
`url` varbinary(255) DEFAULT NULL,
`deviceid` int(11) DEFAULT NULL,
`lyrics` text,
`uniqueid` varbinary(32) DEFAULT NULL,
PRIMARY KEY (`lid`),
UNIQUE KEY `lyrics_url` (`url`,`deviceid`),
KEY `lyrics_uniqueid` (`uniqueid`)
) ENGINE=InnoDB AUTO_INCREMENT=110 DEFAULT CHARSET=latin1

Once the table was in place, I ran Sphinx’s indexer utility to create the full text indexes.


Then I added the SPHINX table as per the documentation:


show create table sphinx_search\G
*************************** 1. row ***************************
Table: sphinx_search
Create Table: CREATE TABLE `sphinx_search` (
`id` int(11) NOT NULL,
`weight` int(11) NOT NULL,
`query` varchar(3072) NOT NULL,
KEY `query` (`query`)
) ENGINE=SPHINX DEFAULT CHARSET=latin1 CONNECTION='sphinx://localhost:3312/test1'

A difference from the examples in Sphinx’s documentation is that I left out any additional attributes. And finally, I created a view to hide as many implementation details as possible


CREATE VIEW `vlyrics` AS select `ly`.`lid` AS `vlid`,
`ly`.`url` AS `vurl`,
`ly`.`deviceid` AS `vdevicedid`,
`ly`.`lyrics` AS `vlyrics`,
`ly`.`uniqueid` AS `vuniqueid`,
`ss`.`query` AS `vquery`
FROM (`sphinx_search` `ss` join `lyrics` `ly`) WHERE (`ss`.`id` = `ly`.`lid`);

Last but not least, I found out how many songs have the word “death” in their lyrics (I was going to use “love”, but it sounded too lame):


select vlid, vurl from vlyrics where vquery = 'death';
+------+------------------------------------------------------------------------------------------------------------+
| vlid | vurl |
+------+------------------------------------------------------------------------------------------------------------+
| 60 | ./opt/musica/iTunes Music/Genesis/The Lamb Lies Down on Broadway Disc 2/03 Anyway.wma |
| 66 | ./opt/musica/iTunes Music/Compilations/Evita_ Premiere American Recording (Disc/1-04 Buenos Aires.mp3 |
| 88 | ./opt/musica/iTunes Music/Lake_&_Palmer_Emerson/Works__Vol._1/06_-_Closer_To_Believing_(Album_Version).mp3 |
+------+------------------------------------------------------------------------------------------------------------+
3 rows in set (0.04 sec)

Notice that in this case the the character “=” doesn’t mean “equals”, but “contains”. If I had renamed the original table and used its name for the view, this would have been the only change as opposed to using “like”:


select lid, url from lyrics where lyrics like '%death%';
+-----+------------------------------------------------------------------------------------------------------------+
| lid | url |
+-----+------------------------------------------------------------------------------------------------------------+
| 60 | ./opt/musica/iTunes Music/Genesis/The Lamb Lies Down on Broadway Disc 2/03 Anyway.wma |
| 66 | ./opt/musica/iTunes Music/Compilations/Evita_ Premiere American Recording (Disc/1-04 Buenos Aires.mp3 |
| 88 | ./opt/musica/iTunes Music/Lake_&_Palmer_Emerson/Works__Vol._1/06_-_Closer_To_Believing_(Album_Version).mp3 |
+-----+------------------------------------------------------------------------------------------------------------+
3 rows in set (0.02 sec)

Note: disregard the times for each query—the table has only about 100 rows.


Conclusion


Implementing SphinxSE into MySQL proved to be easier than it seemed in the beginning, although it took some time to compile and install everything. With a creative use of views, it could potentially be implemented right away in legacy applications, offering numerous advantages.


Sphinx offers many additional features that I do not consider here. I recommend looking into the documentation to implement the more sophisticated indexing mechanisms and search methods.


This evaluation was done with Sphinx v0.9.8, which is labeled as stable since it was done for a customer project. Version 0.9.9, the current release candidate, has an new feature: it supports the MySQL communications protocol. This makes it possible to query text search engine directly using MySQL’s client libraries, which opens a new set of interesting possibilities.


PlanetMySQL Voting:
Vote UP /
Vote DOWN"

No comments:

Sike's shared items