Add multilanguage support #13

mattico · 2018-03-19T19:56:05Z

https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js

Keats · 2018-03-20T18:09:15Z

Are you planning to add all the languages from that repo?

mattico · 2018-03-20T19:18:03Z

No, just the ones that I can easily find stemmers for. Right now that's the languages supported by https://github.com/CurrySoftware/rust-stemmers minus Hungarian since that one didn't match lunr-languages' output. There are a few more languages that could be added pretty easily by running the snowball compiler, but I don't think I'll go through the effort unless someone actually wants them.

xoac · 2020-09-10T23:31:04Z

So If I would like to add support for Polish language I need add it to snowball first?

mattico · 2020-09-11T03:04:17Z

You need a rust implementation, and a javascript implementation that are both compatible. The snowball compiler is one way to generate both implementations, but you could port an algorithm manually as well.

mexus · 2020-10-20T19:33:36Z

hi @mattico ! I'd like to help you to get the multilanguage integration happen. could you please provide any guidance?

mattico · 2020-10-22T02:54:50Z

First, to be clear, multi-language means a search index that supports content that is written in multiple languages. A single document which has multiple languages. We already support searching many languages individually.

Second, the main constraint of the implementation is to be compatible with the Javascript implementation. So the starting point for any addition should be understanding how the Javascript implementation works and converting it. The readme of elasticlunr.js says it can use https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js Tests should be added which generate an index using the javascript implementation and compare it to an index generated using the rust implementation.

More specifically it looks like lunr.multi.js takes a bunch of language pipelines as arguments and combines them together into one. Language pipelines have a few distinct parts which are run sequentially:

A tokenizer, which splits the input text into words where there are whitespace characters. Supporting some languages properly is more difficult, e.g. Chinese doesn't generally use spaces to delineate words per-se so it needs a segmentation algorithm which could not be combined in this way and would need to be run sequentially. We are limited, though, by being compatible with the Javascript implementation. If we want to do things properly we could ship our own modified javascript plugin for people to use.
A trimmer, which removes invalid characters from the beginning and end of words. You can see that english just uses the regex \w. lunr.multi.js just concatenates all the valid characters into one string and then removes the union of the characters for trimming.
A stop word filter, which just removes words which make search results worse: (https://en.wikipedia.org/wiki/Stop_word). Again these can just be combined into one large stop word filter, just a HashSet in our case.
A stemmer (https://en.wikipedia.org/wiki/Stemming) which reduces words to their basic form, removing prefixes and suffixes, etc. These can be very different code for each language so the only option is to run each stemmer sequentially on each input word.

These all get combined into a pipeline, which is just a list of functions which each get run sequentially on each input token to produce the output token. The MultiLanguage language can take a number of languages as an argument and then combine them into one pipeline as above.

mexus · 2020-10-22T11:16:31Z

Thanks a lot! Everything seems to be clear :)

vitvakatu · 2020-10-22T16:59:38Z

Thank you for such a thorough answer, @mattico!

I've managed to implement support for Russian and English languages. Unfortunately, I did neither made a universal solution for all possible combinations of languages, nor covered it with tests.

I hope I will find some spare time in the near future to implement universal support properly and send a PR.

Btw, I've also encountered a weird issue with IndexBuilder: for some reason, using IndexBuilder instead of Index::new gave me different results, despite the identical parameters. I can't say whether it is an issue with IndexBuilder itself, or with our overall setup. The issue was fixed by replacing BTreeSet with Vec in IndexBuilder, so perhaps the order of fields affects generated index. I'll create an issue if my further investigation show something.

mattico self-assigned this Mar 19, 2018

mattico added the enhancement label Mar 19, 2018

mattico unassigned mattico Sep 11, 2020

welpo mentioned this issue Jun 13, 2024

Add language support for search engine welpo/tabi#329

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilanguage support #13

Add multilanguage support #13

mattico commented Mar 19, 2018

Keats commented Mar 20, 2018

mattico commented Mar 20, 2018

xoac commented Sep 10, 2020

mattico commented Sep 11, 2020

mexus commented Oct 20, 2020

mattico commented Oct 22, 2020

mexus commented Oct 22, 2020

vitvakatu commented Oct 22, 2020

Add multilanguage support #13

Add multilanguage support #13

Comments

mattico commented Mar 19, 2018

Keats commented Mar 20, 2018

mattico commented Mar 20, 2018

xoac commented Sep 10, 2020

mattico commented Sep 11, 2020

mexus commented Oct 20, 2020

mattico commented Oct 22, 2020

mexus commented Oct 22, 2020

vitvakatu commented Oct 22, 2020