Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch highlight lose character when index_options=offsets #60168

Open
tenlee2012 opened this issue Jul 24, 2020 · 9 comments · May be fixed by #119301
Open

elasticsearch highlight lose character when index_options=offsets #60168

tenlee2012 opened this issue Jul 24, 2020 · 9 comments · May be fixed by #119301
Labels
>bug good first issue low hanging fruit priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@tenlee2012
Copy link

Elasticsearch version (bin/elasticsearch --version): 7.7.1

Plugins installed: []

JVM version (java -version): openjdk version "11.0.7" 2020-04-14 LTS

OS version (uname -a if on a Unix-like system): GNU/Linux

Description of the problem including expected versus actual behavior:
set properties index_options=offsets,
doc title is can work?shit, serach result highlight is "<em>shit</em>".
If the search match the last word, and the last word is preceded by a tag symbol, then the highlighting will wrong when set properties index_options=offsets.

Steps to reproduce:

  1. create index
PUT demo_1
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard",
        "index_options": "offsets"
      }
    }
  }
}
  1. put data
PUT demo_1/_doc/1
{
    "title": "can work?shit"
}
  1. search
POST demo_1/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "shit"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

search result is blow:

{
    "took":2,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":{
            "value":1,
            "relation":"eq"
        },
        "max_score":0.2876821,
        "hits":[
            {
                "_index":"demo_1",
                "_type":"_doc",
                "_id":"1",
                "_score":0.2876821,
                "_source":{
                    "title":"can work?shit"
                },
                "highlight":{
                    "title":[
                        "<em>shit</em>"
                    ]
                }
            }
        ]
    }
}
  1. if set hightlight type=plain, it works.
POST demo_1/search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "shit"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "title": {
        "type": "plain"
      }
    }
  }
}

result is blow:

{
    "took":2,
    "timed_out":false,
    "_shards":{
        "total":1,
        "successful":1,
        "skipped":0,
        "failed":0
    },
    "hits":{
        "total":{
            "value":1,
            "relation":"eq"
        },
        "max_score":0.2876821,
        "hits":[
            {
                "_index":"demo_1",
                "_type":"_doc",
                "_id":"1",
                "_score":0.2876821,
                "_source":{
                    "title":"can work?shit"
                },
                "highlight":{
                    "title":[
                      "can work?<em>shit</em>"
                    ]
                }
            }
        ]
    }
}
@tenlee2012 tenlee2012 added >bug needs:triage Requires assignment of a team area label labels Jul 24, 2020
@jimczi jimczi added :Search Relevance/Highlighting How a query matched a document and removed needs:triage Requires assignment of a team area label labels Jul 27, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Highlighting)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 27, 2020
@mayya-sharipova mayya-sharipova added the good first issue low hanging fruit label Mar 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@DHRUV6029
Copy link

Hi can i work on this issue

@mayya-sharipova
Copy link
Contributor

@DHRUV6029 So far there is no PR submitted for this issue. You are welcome to submit yours.

@arshPratap
Copy link

Hi @mayya-sharipova .. I would like to work on this issue. Can this issue be assigned to me?

@mayya-sharipova
Copy link
Contributor

@arshPratap You are welcome to submit a PR, we don't assign external contributors to an issue.

@thaotran27
Copy link

Hi! I want to contribute to this issue.

So far, the issue I have found is during formatting of the highlight field, under the CustomPassageFormatter class, in the format method. The passage.getStartOffset() for the example "can work?shit" was 9, when it should have been 0.

Upon further inspection, the passage StartOffset is assigned to be off.startOffset() in the highlightOffsetsEnums method in the FieldHighlighter class, which is called by the highlightFieldForDoc method in the same class. However, I do not know how the startOffset of OffsetsEnum is set and where it is set, and if reader affect the OffsetEnum.

I hope to receive more guidance on this issue!

@javanna javanna added Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@javanna javanna added the priority:normal A label for assessing bug priority to be used by ES engineers label Jul 18, 2024
@sidiki97
Copy link

@mayya-sharipova @javanna
I opened a pull request for this issue Update preceding method #119301
Please review and let me know if any changes need to be made. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug good first issue low hanging fruit priority:normal A label for assessing bug priority to be used by ES engineers :Search Relevance/Highlighting How a query matched a document Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
10 participants