Kurt Jarchow's Blog

April 20, 2009

SOLR and Multi-Languages

Filed under: Uncategorized — Kurt Jarchow @ 9:09 pm

One of our site requires Japanese search capabilities which looked easy on paper, but after playing around for the better part of a day I realized it wasn’t. It doesn’t take much to get it going, but sifting through mailing lists is pretty much the only way to get the details.

If you do a search using Japanese characters you’ll notice that solr will find results out of the box. The problem is that it uses English methods of search. The main problems this creates is that is assumes spaces separate words, which is not true in Japanese. A “tokenizer” creates the words for lucene to use. So we need to use a tokenizer that will work with Japanese. The CJK Tokenizer does the trick.

SOLR will not handle multiple languages in a single field so we need to create a new field for each language. First though you need to create the type of field:

Now define the field:

I, for whatever reason, could not get this going without specifically identifying which field the search should use, so I added “qf=body_ja” to my query string. If you see other syntax for defining the field type don’t use it (at least with solr 1.3-1.4). It seems to break the words up correctly but you won’t be able to search.

This tokenizer will is not 100% however. It breaks the characters into pairs and solr does its best to find matches. I’m not sure for how much, but you can find a better tokenizer at basistech.com.

After you adjust your front end to save into the new fields you should be off to the race.

1 Comment »

  1. Re Basis: I know for how much, but let me just say this: for a LOT (a LOT!) more.

    This may be of interest to a lot of Solr users who need multi-language indexing and searching:
    http://www.sematext.com/product-multilingual-analyzer.html

    Comment by Otis Gospodnetic — April 20, 2009 @ 10:13 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress

Bad Behavior has blocked 90 access attempts in the last 7 days.