title | date | area | tags | |||
---|---|---|---|---|---|---|
New language inheritance mechanism for opensearch |
2023-04-11 |
Core,Elasticsearch |
|
Currently, when using Elasticsearch for searching on storefront, we are creating multiple indexes of each language. This would be fine till now however there are a few problems with it:
- We need to manage multiple indexes, if the shop's using multilingual, we need to create several indexes for each language, this is a big problem on cloud especially
- "Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard."
- Everytime a record is updated, we need to update that record in every language indexes
- There's currently no fallback mechanism when searching, therefor duplicating default language data for each index is needed, but not every field is translatable, this take more storage for each index
We introduce a new feature flag ES_MULTILINGUAL_INDEX
to allow people to opt in to the new multilingual ES index immediately.
We changed the approach to Multilingual fields strategy following these criteria
- Each searchable entity now have only one index for all languages (e.g sw_product)
- Each translated field will be mapped as an
object field
, each language_id will be a key in the object - When searching for these fields, use multi-match search with <translated_field>.<context_lang_id>, <translated_field>.<parent_current_lang_id> and <translated_field>.<default_lang_id> as fallback, this way we have a fallback mechanism without needing duplicate data
- Same logic applied when sorting with the help of a painless script (see 3.Sorting below)
- When a new language is added or a record is update, we do a partial update instead of replacing the whole document, this will reduce the request update payload and thus improve indexing performance overall
Example:
OLD structure
// PUT /sw_product
{
"mappings": {
"properties": {
"productNumber": {
"type": "keyword"
},
"name": {
"type": "keyword",
"fields": {
"text": {
"type": "text"
},
"ngram": {
"type": "text",
"analyzer": "sw_ngram_analyzer"
}
}
}
}
}
}
NEW structure
// PUT /sw_product/_mapping
{
"mappings": {
"properties": {
"productNumber": {
"type": "keyword"
},
"name": {
"properties": {
"en": {
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "sw_english_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "sw_ngram_analyzer"
}
}
},
"de": {
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "sw_german_analyzer"
},
"ngram": {
"type": "text",
"analyzer": "sw_ngram_analyzer"
}
}
}
}
}
}
}
}
Assume we're searching products in german
// GET /sw_product/_search
{
"query": {
"multi_match": {
"query": "some keyword",
"fields": [
"title.de.search", // context languge
"title.en.search" // fallback language
],
"type": "best_fields"
}
}
}
We add new painless scripts in Framework/Indexing/Scripts/translated_field_sorting.groovy
and Framework/Indexing/Scripts/numeric_translated_field_sorting.groovy
, this script then will be used when sorting
Example: Sort products by name in DESC
// GET /sw_product/_search
{
"query": {
...
},
"sort": [
{
"_script": {
"type": "string",
"script": {
"id": "translated_field_sorting",
"params": {
"field": "name",
"languages": [
"119317f1d1d1417c9e6fb0059c31a448", // context language
"2fbb5fe2e29a4d70aa5854ce7ce3e20b" // fallback language
]
}
},
"order": "DESC"
}
}
]
}
- When a new language is created, we perform this request to update mapping includes new added language
// PUT /sw_product/_mapping
{
"properties": {
"name": {
"properties": {
"<new_language_id>": {
"type": "keyword",
"fields": {
"text": {
"type": "text",
"analyzer": "<new_language_stop_words_analyzer>"
},
"ngram": {
"type": "text",
"analyzer": "sw_ngram_analyzer"
}
}
}
}
}
}
}
- From the next major version, old language based indexes will not be used any longer thus could be removed on es cluster
- When the feature is activated, the shop must reindex using command
bin/console es:index
in the next update