Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Sunshine40 · 2024-06-05T08:25:43Z

Lines 40 to 42 in 29d97e4

    
           fn add_token(&mut self, doc_ref: &str, token: &str, term_freq: f64) { 
        
               let mut iter = token.chars(); 
        
               if let Some(character) = iter.next() {

During index building, elasticlunr-rs iterates over the token &str's content in Unicode Scalar Values.

While the JS library does it in this way:

elasticlunr.InvertedIndex.prototype.addToken = function (token, tokenInfo, root) {
  var root = root || this.root,
      idx = 0;

  while (idx <= token.length - 1) {
    var key = token[idx];

The JS string is actually iterated in UTF-16 Code Units, which are entire characters for English, most alphabetic text, common Chinese characters; but not Emojis and rare Chinese characters.

Related issue with mdBook.

The text was updated successfully, but these errors were encountered:

mattico added bug help wanted labels Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Sunshine40 commented Jun 5, 2024 •

edited

Loading

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Index created by elasticlunr-rs doesn't work with elasticlunr.js for characters that can't be represented by a single UTF-16 Code Unit #53

Comments

Sunshine40 commented Jun 5, 2024 • edited Loading

Sunshine40 commented Jun 5, 2024 •

edited

Loading