Hebrew ‘Did You Mean’ algorithm

Preface

Testing other Did You Mean engines shows that most of them are based on plain dictionary corrections. I call it spell checker rather than a true did you mean algorithm. The reason is that such ‘dictionary based‘ suggestions may be suitable for general text collections not dealing with specific field or domain, but one might expect finding a closer match when it comes to domain specific dictionary or vocabulary.

For instance, let’s take a look at what Google offers when the term “site:www.whitehouse.gov jefferzon” is being sought. One would expect Google to imply that the user asked for former president Thomas Jefferson. After all, the search was limited to the white house web site, both terms only differ in 1 letter [s-->z], and share same soundex key. However, Google returns no results for this term – and no suggestion too.

Such results are unacceptable when it comes to CRM systems, where a representative is trying to locate  a customer. Needless to say, the SQL built in SOUNDEX() function turns useless when it comes to full text searches.

This led me to develop the algorithm I use in our product.

Goal

My goal was to reach a ‘did you mean’ algorithm that will satisfy our customers and will not require any SQL server installation.

The�basic requirements were:

  • Hebrew support
  • Per field best match
  • Preferably independent of SQL server/Oracle

and of course, all of the regular requirements (high accuracy level, low cost implementation and high speed calculations)

How it works

  • The first step is to generate a wordlist and an accompanying hit vector. This allows us to identify which term appears in which field and how frequent it is. For instance, the name frigate might be common in a naval website wordlist, whereas the term Freegate will appear more often in software listings.
  • Keyboard typos are tested. If no matching word found then:
  • The next step takes hints regarding what algorithm flavor to use:
    • If the hint stated that a phonetic match should be tested, then terms are tested for phonetic match. This works best for names (people, products, addresses, companies, stocks, etc.)
    • Plain dictionary suggestions are tested for partial matching (bi-grams, tri-grams,…n-grams). Best matches are offered after weighting the distance along with each terms frequency per the specific field.
  • Additional tests may be applied, as customer might require that tests should also be performed against previous searches, for example.

 

It is most important to remember that all suggestions are narrowed by additional search hints. If user asked for a specific search in street name where zip code is like ‘97*’, the engine will never offer any company names or even street names from other zip codes.

 

Current status

My current version of the Hebrew ‘Did You Mean’ algorithm comes in 3 flavors:

  • Fuzzy-relevance match -  Plain wordlist comparison using techniques described below.
  • Phonic match - Where names like ‘Lipson’ and ‘Leibzon’ mean the same
  • Keyboard typos – converting qwerty layout to it’s Hebrew equivalent (akuo = Shalom [hello/peace in Hebrew]).

As of September ‘09 this algorithm is demonstrated to our customers and is being fine tuned on its way to its first production environment.

 

To do / Wish list

  • Eliminate noise words from the wordlist - Digits, dates, etc. – Done!
  • Support composite terms (2+ search terms) more natively – Done!
  • Refine Hebrew phonetic algorithm – In progress…
  • Speed up phonetic rebuild process. - Done!

 

Click here for updates on the new multilingual did-you-mean algorithm