Too many incorrect predictions with Elasticsearch category predictor #918

raphael0202 · 2022-09-27T16:19:05Z

On the 25 last elasticsearch predicted categories, only 11 are accurate. This is due to the matching algorithm, that allows over-matching.

Examples of incorrect matches:

Pepperoni matches category Pepperoni pizzas
Baked matches category Baked alaska
Sauce piquante matches category Langues de bœuf sauce piquante

For reference, here is how elasticsearch category matching is performed:

all categories names are stored in the Elasticsearch category index. We currently support French, English, Spanish and German. The following processing is performed on all product names of supported languages:
- ascii folding (accent removal)
- lowercasing
- elision removal (removing d' and l', only for French)
- language-specific stopword removal
- stemming
the product name is used as the query, and we're performing a match_phrase search, which means every word in the query must be found in the same order in the category name. The same processing is performed by Elasticsearch on the query before matching. Please note that Elasticsearch keeps an empty position index when removing stop words. For example, barre/1 de/2 chocolat/3 becomes barre/1 chocolat/3 after stop words removal, and a query "barre chocolat" (barre/1 chocolat/2) wouldn't return the document. To prevent this from happening, a custom version of Elasticsearch with a "gapless" plugin is used.

The issue comes from the fact that we the current matching strategy allow submatches (see examples above), which trigger too many false positive.

Possible strategy:

maybe there is a way to deal with this inside Elasticsearch, but I'm not aware of it (the keyword type which is used for full match doesn't seem to be usable with our processing here)
add checks after Elasticsearch query to see if we indeed have a full match
don't rely on Elasticsearch anymore for category name matching, add it inside Robotoff instead

I'm personally in favor of the 3rd option, as:

Elasticsearch is definitely overkill for this kind of matching: we're not matching documents, only a few thousands short names. And the custom version of Elasticsearch used is a major technical debt.
We would have more flexibility in the matching process
We can drop Elasticsearch requirement in the longer run (it's not ideal too to rely on Elasticsearch for spellcheck too, as it can be developed in Python too and be more flexible)

The text was updated successfully, but these errors were encountered:

alexgarel · 2022-09-27T17:34:43Z

This seems ok to me to drop Elasticsearch.

stephanegigandet · 2022-09-28T10:25:12Z

the product name is used as the query

That sounds really strange indeed, we should look for the category names inside the product name, not the reverse.

Just a regexp will all category names and synonyms should work I think (with preprocessing applied first). Same approach as used in #890

In fact we could have a separate module to match entries from taxonomies. It's what we now have on the Perl backend:
https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Tags.pm#L4229

raphael0202 · 2022-09-28T15:08:50Z

Looking for the category name inside the product name may also return false positives, as some ingredients are also categories (pineapple, passion fruits,...).
But I don't think this is that common, the simplest way to avoid this is to have a list of ingredients where we require a full match.

raphael0202 · 2022-11-22T10:22:03Z

We're not using Elasticsearch anymore for category prediction, closing this issue.

raphael0202 added 🐛 bug Something isn't working category prediction elasticsearch labels Sep 27, 2022

teolemon added this to 🤖 Artificial Intelligence @ Open Food Facts Sep 27, 2022

teolemon moved this to Todo in 🤖 Artificial Intelligence @ Open Food Facts Sep 27, 2022

raphael0202 mentioned this issue Sep 30, 2022

feat: update category matching algorithm #924

Closed

raphael0202 mentioned this issue Oct 12, 2022

feat: update category matching algorithm #952

Merged

raphael0202 closed this as completed Nov 22, 2022

Repository owner moved this from Todo to Done in 🤖 Artificial Intelligence @ Open Food Facts Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many incorrect predictions with Elasticsearch category predictor #918

Too many incorrect predictions with Elasticsearch category predictor #918

raphael0202 commented Sep 27, 2022

alexgarel commented Sep 27, 2022

stephanegigandet commented Sep 28, 2022

raphael0202 commented Sep 28, 2022

raphael0202 commented Nov 22, 2022 •

edited

Loading

Too many incorrect predictions with Elasticsearch category predictor #918

Too many incorrect predictions with Elasticsearch category predictor #918

Comments

raphael0202 commented Sep 27, 2022

alexgarel commented Sep 27, 2022

stephanegigandet commented Sep 28, 2022

raphael0202 commented Sep 28, 2022

raphael0202 commented Nov 22, 2022 • edited Loading

raphael0202 commented Nov 22, 2022 •

edited

Loading