home 首页 » 精彩博客 » 博客查看

Chinese Xapian search and indexing - Alan Knowles

 2008-03-09 19:40:19 

原文地址:http://www.akbkhome.com/blog.php/View/158/Chinese_Xapian_search_and_indexing.html

its funny how you can often end up solving pretty much the same problem twice, dejavu for coding. Last weeks challenge was a free text search engine for email archives including support for chinese.

About 7 years ago, I remember hacking on mnogosearch to solve a pretty similar problem. After some research this time, I settled on Xapian, some of the reasons included,
- utf8 internal support
- nice bindings for PHP & c (via gcc/C++)
- a working set of command line tools (omindex & quest)
- database independant ~ no mysql dependancies
And the test that always makes the deal is that after apt-get'tting the package, it just worked! Creating a working store and runing queries is quite simple

The only trouble was that although it says it supports utf8, actual support for chinese is a bit more complex.

Unlike western langages, each character needs to be treated like a word. Ideally Xapian would realize this, however without wanting to hack the C++ code, I decided it would be quicker to create files based on the original email and pad the chinese characters with spaces. This means, in the short term, I can use omindex, rather than, binding the D code directly to xapian API.

the way this is done in the D index builder is
- parse the email with callbacks for each mime part. I have ported the mime code from binc imap for this, and called it dinc (silly name for the week)
- if the part is text, or html, convert it to utf8 using iconv
- stream read each line of the body (using memory streams)
- convert each line to utf32/dchars
- loop through the dchars and see if the are chinese, japanese or korean (see this for the simple check for cjk characters). pad the ouput line with spaces when found
- convert the resulting utf32 array to utf8/char array, and write it to the output stream (file stream),
all this code should be quite simple to extend when i get round to the d direct access to xapian api. It should also be pretty memmory efficient, and fast when i rewrite the iconv code to work like a stream filter...


On the other end of this was making the extjs/php5 front end do the searching. Again, the end user would be expected to search in chinese as a series of characters, eg. type 'XXX' rather than type 'X X X' so in php. I needed to convert the search string into utf32, and compare each block of 4 characters against the previous list of cjk charcters, padding with spaces. then converting back to utf8 prior to sending the query. This is all pretty simple using iconv or mb_string.

All in all, not that difficult to do, however actually finding/working how to do this was quite a challenge.

收藏:

评论:共 
272
 条