Improve search in publications and datasets overview #1604

nicolasfranck · 2024-06-14T14:07:17Z

nicolasfranck
Jun 14, 2024
Maintainer

Current implementation on /publication

The current search implementation on page /publications uses the following query:

The query string is fed into a query of type Lucene simple-query-string, with the following attributes set:

fields:

[
  "id^100",
  "identifier^50",
  "isxn^50",
  "title^40",
  "organization_id^50",
  "contributor.phrase_ngram^0.05",
  "contributor.ngram^0.01",
  "all"
]

flags: PHRASE

If we ignore a few of these fields (and user only identifier and all for example) and then type a query like my : long query (note the colon), you'll get the following query within ElasticSearch:

POST http://localhost:9200/biblio_datasets/_validate/query?pretty&explain
{
   "query" : {
      "simple_query_string" : {
         "flags" : "PHRASE",
         "lenient" : "true",
         "fields" : [
            "identifier^50",
            "all"
         ],
         "auto_generate_synonyms_phrase_query" : "true",
         "minimum_should_match" : "100%",
         "analyze_wildcard" : "false",
         "query" : "my : long query",
         "default_operator" : "AND"
      }
   }
}

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "biblio_datasets_20240614104525",
      "valid" : true,
      "explanation" : "((+all:my +all:long +all:query) | (identifier:my : long query)^50.0)~1.0"
    }
  ]
}

Syntax:

|: "OR" operator
( and ) : precedence operators. Used to group elements
+: mark a subquery as mandatory. Without this a subquery is optional
+<field>:<value> means that you search in field <field> with value <value>. e.g.identifier:my : long query searches for value my : long query in field identifier

So the full interpretation by ElasticSearch is:

Either sub query one needs to match: (+all:my +all:long +all:query)
Either sub query two needs to match: (identifier:my : long query)

So the whole query string is fed to every field. This is due to the missing flag WHITESPACE (see further).

Every field on each own further splits the given string into more tokens, and so construct their own subquery,
according to their rules:

field all field on whitespace, filters out punctuation, and accents, and so generates a list of mandatory subqueries: +all:my +all:long +all:query
field identifier does NOT do any splitting, but merely converts all tokens to lowercase. So the whole query string becomes one lowercased token on which is searched: my : long query

This way of querying does not always work in the current production environment. e.g. type a DOI with a space after it. e.g. mydoi/123

field all does NOT contains a copy of field values from identifier (where DOI is in), so no matching is possible there
field identifier does not split on whitespace, and so leaves the whitespace, leading to zero matches there too.

Same goes for other values that are not copied into the field all:

id
contributor.phrase_ngram

For these fields the search will only return results as long as the values are trimmed.

Reasons for current implementation

One solution for the current matching algorithm would be to add the flag WHITESPACE,
changing the interpretation:

POST http://localhost:9200/biblio_datasets/_validate/query?pretty&explain
{
   "query" : {
      "simple_query_string" : {
         "default_operator" : "AND",
         "analyze_wildcard" : "false",
         "flags" : "PHRASE|WHITESPACE",
         "minimum_should_match" : "100%",
         "fields" : [
            "identifier^50",
            "all"
         ],
         "auto_generate_synonyms_phrase_query" : "true",
         "query" : "my : long query",
         "lenient" : "true"
      }
   }
}

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "valid" : true,
  "explanations" : [
    {
      "index" : "biblio_datasets_20240614104525",
      "valid" : true,
      "explanation" : "+(all:my | (identifier:my)^50.0)~1.0 +(identifier::)^50.0 +(all:long | (identifier:long)^50.0)~1.0 +(all:query | (identifier:query)^50.0)~1.0"
    }
  ]
}

Now the full query is split into tokens by the query analyzer itself before being fed into every separate field analyzer (which now no longer receives the full sentence),
leading to these mandatory sub queries (boosts removed for readability):

+(all:my | identifier:my)
+(identifier::) -> PROBLEMATIC
+(all:long | identifier:long)
+(all:query | identifier:query)

The second mandatory query is a problem though (field identifier must contain : exactly). Note that in this subquery an optional query on the field all is missing, because the tokenizer of that field removed all punctuations. So in this specific subquery a match on the field all would have helped, but is now missing.

That is why the flag WHITESPACE was removed in #600.

What can be done:

Let ES copy the value of identifier into the field all
Only search on field all

Some conclusions:

in the old implementation the order of the tokens do not matter as long as all of these values are contained in the field all that performs its own whitespace-splitting.
in the old implementation punctuations and whitespace characters are only removed if there is a match in the field all
one should never search over multiple fields that have conflicting opinions about tokenization (what characters to keep, how to split into tokens).
any relevance boosting is ignored by the user interface that ignores the calculated score, and always sorts on dates on years. So more stricter matches are ranked higher than less stricter matches.

e.g. query "my long title"
docs:

   1. title: "my long title 2" 
   2. title: "my long title"

One would expect the second document to be on top, but the UI sorts on field last-updated,
so cannot guarantee any sorting relevance.

Solution suggestion for /publication

change attribute simple_query_string.fields to ["title^40", "all", "contributor.ngram"]
change ES mapping:
- copy field identifier to field all
- copy field organization_id to field all
- copy field isxn (containing all kinds of ISSN and ISBN) to field all
possibly remove extra relevance boosting query (not shown here) that serves no purpose
possibly remove the query on field title too (I contradict myself here) if relevance serves no purpose, copy the value to field all, and remove the field from the list.

nicolasfranck · 2024-06-14T14:37:30Z

nicolasfranck
Jun 14, 2024
Maintainer Author

See https://github.com/ugent-library/biblio-backoffice/tree/es6_search_improvements for change suggestions

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve search in publications and datasets overview #1604

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Improve search in publications and datasets overview #1604

nicolasfranck Jun 14, 2024 Maintainer

Current implementation on /publication

Reasons for current implementation

Some conclusions:

Solution suggestion for /publication

Replies: 1 comment

nicolasfranck Jun 14, 2024 Maintainer Author

nicolasfranck
Jun 14, 2024
Maintainer

nicolasfranck
Jun 14, 2024
Maintainer Author