Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilanguage support #13

Open
mattico opened this issue Mar 19, 2018 · 8 comments
Open

Add multilanguage support #13

mattico opened this issue Mar 19, 2018 · 8 comments

Comments

@mattico
Copy link
Owner

mattico commented Mar 19, 2018

https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js

@mattico mattico self-assigned this Mar 19, 2018
@Keats
Copy link
Contributor

Keats commented Mar 20, 2018

Are you planning to add all the languages from that repo?

@mattico
Copy link
Owner Author

mattico commented Mar 20, 2018

No, just the ones that I can easily find stemmers for. Right now that's the languages supported by https://github.com/CurrySoftware/rust-stemmers minus Hungarian since that one didn't match lunr-languages' output. There are a few more languages that could be added pretty easily by running the snowball compiler, but I don't think I'll go through the effort unless someone actually wants them.

@xoac
Copy link

xoac commented Sep 10, 2020

So If I would like to add support for Polish language I need add it to snowball first?

@mattico
Copy link
Owner Author

mattico commented Sep 11, 2020

You need a rust implementation, and a javascript implementation that are both compatible. The snowball compiler is one way to generate both implementations, but you could port an algorithm manually as well.

@mexus
Copy link

mexus commented Oct 20, 2020

hi @mattico ! I'd like to help you to get the multilanguage integration happen. could you please provide any guidance?

@mattico
Copy link
Owner Author

mattico commented Oct 22, 2020

First, to be clear, multi-language means a search index that supports content that is written in multiple languages. A single document which has multiple languages. We already support searching many languages individually.

Second, the main constraint of the implementation is to be compatible with the Javascript implementation. So the starting point for any addition should be understanding how the Javascript implementation works and converting it. The readme of elasticlunr.js says it can use https://github.com/MihaiValentin/lunr-languages/blob/master/lunr.multi.js Tests should be added which generate an index using the javascript implementation and compare it to an index generated using the rust implementation.

More specifically it looks like lunr.multi.js takes a bunch of language pipelines as arguments and combines them together into one. Language pipelines have a few distinct parts which are run sequentially:

  1. A tokenizer, which splits the input text into words where there are whitespace characters. Supporting some languages properly is more difficult, e.g. Chinese doesn't generally use spaces to delineate words per-se so it needs a segmentation algorithm which could not be combined in this way and would need to be run sequentially. We are limited, though, by being compatible with the Javascript implementation. If we want to do things properly we could ship our own modified javascript plugin for people to use.
  2. A trimmer, which removes invalid characters from the beginning and end of words. You can see that english just uses the regex \w. lunr.multi.js just concatenates all the valid characters into one string and then removes the union of the characters for trimming.
  3. A stop word filter, which just removes words which make search results worse: (https://en.wikipedia.org/wiki/Stop_word). Again these can just be combined into one large stop word filter, just a HashSet in our case.
  4. A stemmer (https://en.wikipedia.org/wiki/Stemming) which reduces words to their basic form, removing prefixes and suffixes, etc. These can be very different code for each language so the only option is to run each stemmer sequentially on each input word.

These all get combined into a pipeline, which is just a list of functions which each get run sequentially on each input token to produce the output token. The MultiLanguage language can take a number of languages as an argument and then combine them into one pipeline as above.

@mexus
Copy link

mexus commented Oct 22, 2020

Thanks a lot! Everything seems to be clear :)

@vitvakatu
Copy link

Thank you for such a thorough answer, @mattico!

I've managed to implement support for Russian and English languages. Unfortunately, I did neither made a universal solution for all possible combinations of languages, nor covered it with tests.

I hope I will find some spare time in the near future to implement universal support properly and send a PR.

Btw, I've also encountered a weird issue with IndexBuilder: for some reason, using IndexBuilder instead of Index::new gave me different results, despite the identical parameters. I can't say whether it is an issue with IndexBuilder itself, or with our overall setup. The issue was fixed by replacing BTreeSet with Vec in IndexBuilder, so perhaps the order of fields affects generated index. I'll create an issue if my further investigation show something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants