Using the MySQL Fulltext Search Engine with Data Driven Web Applications – Part I

Part I – The FULLTEXT index and BOOLEAN search language

Searching is one of the most fundamental and common features of the web. We are all used to having powerful search engines such as Google at our fingertips, both for searching the glut of information available on the Internet at large, as well as for indexing our own web sites to provide a search mechanism to our own users.


As long as your site is reasonably static and publicly browsable, engines like Google do a good job of indexing your data. But what do you do when your application becomes too complex or dynamic for it to be properly spidered? In that case, you need your own search engine, one that allows you control over what is being indexed as things change on your site, as well as control over what results are being returned. There are many such engines out there: commercial packages such as Verity, open source packages such as Lucene, as well as integrated engines such as the fulltext index and search features included in MySQL. In this article, I will describe how to use MySQL’s features for adding robust fulltext search to dynamic web applications that otherwise would be difficult to index.

In the past, the only way to do text searching in a database was to use the LIKE comparison in the WHERE clause. This was very limited and usually exceptionally inefficient. The good news is that MySQL has an index type called FULLTEXT that you can create on character data type fields (varchar, text, longtext, etc.). Instead of storing data to help SQL queries efficiently seek and sort the data within the table, it analysis the human language text (words) in the field indexed, so that data can be searched using their fulltext search matching language.

Some simple examples:

First we create a simple table that stores a product id and a product description. We create a fulltext index on the description field, indicating that we want to be able to use the fulltext search language to query data inserted in that field. It should be noted that MySQL’s fulltext indexes only work with the MYISAM database engine.

     create table product_descriptions (
          id int primary key,
          description text,
          FULLTEXT (description)
     ) engine = myisam;

Next we insert some test data into the table. As each row is inserted, the description field is indexed for fulltext searching.

     insert into product_descriptions
          values
               (1,'Peanut butter cookies'),
               (2, 'Chocolate chip cookies'),
               (3, 'Chocolate croissants'),
               (4, 'Peanut Butter Cups'),
               (5, 'Spinach Macaroons'),
               (6, 'Swedish butter cookies');

Now that we have a product description table with indexed data, lets do some example searches!

Searching for “cookies” gives you the 3 rows with the word cookie in them.

     select id, description
     from product_descriptions
     where MATCH(description)
          AGAINST ('cookies' IN BOOLEAN MODE);
id description
1 Peanut butter cookies
2 Chocolate chip cookies
6 Swedish butter cookies

Searching for “cookies butter” gives you and OR search, returning all rows with a reference to cookies OR butter.

     select id, description
     from product_descriptions
     where MATCH(description)
          AGAINST ('cookies butter' IN BOOLEAN MODE);
id description
1 Peanut butter cookies
2 Chocolate chip cookies
4 Peanut butter cup
6 Swedish butter cookies

By adding a “+” character before a word in a search, we indicate that the results MUST include that word. By putting a + before cookies and butter, we turn this into an AND search, in that all results must include both “cookies” and “butter”

     select id, description
     from product_descriptions
     where MATCH(description)
          AGAINST ('+cookies +butter' IN BOOLEAN MODE);
id description
1 Peanut butter cookies
6 Swedish butter cookies

By adding a trailing “*” we can do partial matches on words or even single letters. The following query will return every row that has a word starting with “c”

     select id, description
     from product_descriptions
     where MATCH(description)
          AGAINST ('c*' IN BOOLEAN MODE);
id description
1 Peanut butter cookies
2 Chocolate chip cookies
3 Chocolate croissants
4 Peanut butter Cups
6 Swedish butter cookies

That’s a quick overview of the basics of the MySQL Boolean query language. A more detailed description of the search language, as well as more advance concepts such as query expansion, can be found here at the MySQL website.

A Simple Search Method

So now that we have our product table, fulltext index, and a little familiarity with the Boolean search language, lets create a method for taking a search query from a html form and returning a result set you can display to your users! Lets assume you have a simple search form on your web application; nothing but a text input area and a submit button.

First lets define some pseudo code for a simple search method:

function search(String $search_string) {
     String db_query = "select id, description
          from product_descriptions
          where MATCH(description)
          AGAINST ('"+$search_string+"' IN BOOLEAN MODE)";
     Array results = execute_database_query(db_query);
     return results;
}

What this method does is pretty simple, it just inserts the raw search text from the web form directly into the AGAINST(’ ‘) block of the db query and executes it, returning the result rows, which could be displayed on a search results page. It doesn’t have any logic for handling AND or OR or partials, nor does it do any filtering to make sure the user hasn’t entered characters or keywords that will be interpreted by the Boolean search engine (+,-,*, etc). Also, NEVER EVER PASS DATA DIRECTLY FROM A FORM TO A DATABASE QUERY!!! It leaves you open to something known as a SQL injection attack which allows malicious users to run queries against your db. Always validate and escape form data to ensure this doesn’t happen.

Lets make the method a little smarter and safer. Most search engines do AND searches, in that if you enter two words in the search, you expect those words to both be in your results. They also do leading matched for the words entered. To do this, we will need to do some prepossessing of the search string, before passing it to the search query.

function search(String $search_string) {
     //strip Boolean search characters out of search string
     $search_string = string_replace($search_string,"+","");
     $search_string = string_replace($search_string,"-","");
     $search_string = string_replace($search_string,"*","");
     //split the search string up into an array of words
     Array $tokenized_search = split($search_string, " ");
     //init an empty final search string
     String $processed_search = "";
     //for each word in the search, wrap it
     //with an + and * character and then append
     //it to the processed_search variable
     foreach($tokenized_search as $token) {
          $processed_search+="+"+$token+"* ";
     }
     //build the sql for the query and query the DB
     String $db_query = "select id, description
          from product_descriptions
          where MATCH(description)
          AGAINST ('"+$processed_search+"' IN BOOLEAN MODE)";
     Array results = execute_database_query($db_query);
     return results;
}

For the search “Peanut Cookie” we would reprocess the search text to be “+Peanut* +Cookie*” before passing it off to the search engine. This means we would get all rows that contain the leading string “Peanut” as well as “Cookie”. Rows with the text “peanuts” or “peanutbutter” or “cookies” would also match, because of the trailing wildcard we added to each word.

Stopwords and Minimum Word Length

MySQL’s search engine also has concepts called stopwords and minimum word length. Stopwords are common words that are intentionally left un-indexed and are unsearchable, such as “a, and, for, the” and so on. The intention of these is to prevent lots of unnecessary data from getting indexed and muddying up result sets, but there are some side effects that need to be taken into account as well. For example, the word “down” is in the default stopwords list. if you do a search for “+down +stairs” it will always fail, even if your searching column has a row with the text “down stairs” indexed. This is because mysql removes “down” from the words it indexes, but it is not smart enough to remove “down” from the query it is passed, meaning that any query with a stopword with a + in front of it will always fail. I have gotten around this by creating a lookup function of the my stopwords list, and checking each token against it as I build my processed_search string. This prevents impossible queries from being executed for otherwise reasonable searches. It is also sometimes useful to reduce the size of your stopwords list, depending on your application. I find the full list is useful for applications that are indexing articles or large blocks of written text, while a much smaller list is useful for things like searchable user profiles.

Minumum word length is a mysql configuration setting that, like stopwords, determins whether a word can be indexed or not. The default setting is four letters, which means any word (say ‘foo’) under four characters will not be indexed. This can occasionally cause problems in much the same ways as stopwords do. If you do a required search for Foo (+foo*) it will fail because foo could never appear in the index. This is usually fine for most applications, but in cases where you index lots of acronyms, or if you are indexing user profiles where people actually have 2 or 3 letter names (”Mel” “Al” “Tim”, etc) you will need to reduce your min word length. Instructions for doing so can be found here.

Next Time

In my next post, i’ll go into detail about building more robust features around these basic search ideas. Paging, sorting, filtering, searching multiple types of data with abstracted search tables, InnoDB, and more!

by John Wolthius
categories: Database
May 10th, 2007

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

You must be logged in to post a comment.