Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Open
Sunshine40 opened this issue Jun 5, 2024 · 0 comments

Comments

@Sunshine40
Copy link

Sunshine40 commented Jun 5, 2024

fn add_token(&mut self, doc_ref: &str, token: &str, term_freq: f64) {
let mut iter = token.chars();
if let Some(character) = iter.next() {

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.


Related issue with mdBook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants