Improve search in publications and datasets overview #1604
nicolasfranck
started this conversation in
Ideas
Replies: 1 comment
-
See https://github.com/ugent-library/biblio-backoffice/tree/es6_search_improvements for change suggestions |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Current implementation on /publication
The current search implementation on page
/publications
uses the following query:fields
:flags
:PHRASE
If we ignore a few of these fields (and user only
identifier
andall
for example) and then type a query likemy : long query
(note the colon), you'll get the following query within ElasticSearch:Syntax:
|
: "OR" operator(
and)
: precedence operators. Used to group elements+
: mark a subquery as mandatory. Without this a subquery is optional+<field>:<value>
means that you search in field<field>
with value<value>
. e.g.identifier:my : long query
searches for valuemy : long query
in fieldidentifier
So the full interpretation by ElasticSearch is:
(+all:my +all:long +all:query)
(identifier:my : long query)
So the whole query string is fed to every field. This is due to the missing flag
WHITESPACE
(see further).Every field on each own further splits the given string into more tokens, and so construct their own subquery,
according to their rules:
all
field on whitespace, filters out punctuation, and accents, and so generates a list of mandatory subqueries:+all:my +all:long +all:query
identifier
does NOT do any splitting, but merely converts all tokens to lowercase. So the whole query string becomes one lowercased token on which is searched:my : long query
This way of querying does not always work in the current production environment. e.g. type a DOI with a space after it. e.g.
mydoi/123
all
does NOT contains a copy of field values fromidentifier
(where DOI is in), so no matching is possible thereidentifier
does not split on whitespace, and so leaves the whitespace, leading to zero matches there too.Same goes for other values that are not copied into the field
all
:id
contributor.phrase_ngram
For these fields the search will only return results as long as the values are trimmed.
Reasons for current implementation
One solution for the current matching algorithm would be to add the flag
WHITESPACE
,changing the interpretation:
Now the full query is split into tokens by the query analyzer itself before being fed into every separate field analyzer (which now no longer receives the full sentence),
leading to these mandatory sub queries (boosts removed for readability):
+(all:my | identifier:my)
+(identifier::)
-> PROBLEMATIC+(all:long | identifier:long)
+(all:query | identifier:query)
The second mandatory query is a problem though (field
identifier
must contain:
exactly). Note that in this subquery an optional query on the fieldall
is missing, because the tokenizer of that field removed all punctuations. So in this specific subquery a match on the fieldall
would have helped, but is now missing.That is why the flag
WHITESPACE
was removed in #600.What can be done:
identifier
into the fieldall
all
Some conclusions:
all
that performs its own whitespace-splitting.all
e.g. query "my long title"
docs:
One would expect the second document to be on top, but the UI sorts on field last-updated,
so cannot guarantee any sorting relevance.
Solution suggestion for /publication
simple_query_string.fields
to["title^40", "all", "contributor.ngram"]
identifier
to fieldall
organization_id
to fieldall
isxn
(containing all kinds of ISSN and ISBN) to fieldall
title
too (I contradict myself here) if relevance serves no purpose, copy the value to fieldall
, and remove the field from the list.Beta Was this translation helpful? Give feedback.
All reactions