Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many incorrect predictions with Elasticsearch category predictor #918

Closed
raphael0202 opened this issue Sep 27, 2022 · 4 comments
Closed

Comments

@raphael0202
Copy link
Collaborator

On the 25 last elasticsearch predicted categories, only 11 are accurate. This is due to the matching algorithm, that allows over-matching.

Examples of incorrect matches:

  • Pepperoni matches category Pepperoni pizzas
  • Baked matches category Baked alaska
  • Sauce piquante matches category Langues de bœuf sauce piquante

For reference, here is how elasticsearch category matching is performed:

  • all categories names are stored in the Elasticsearch category index. We currently support French, English, Spanish and German. The following processing is performed on all product names of supported languages:
    • ascii folding (accent removal)
    • lowercasing
    • elision removal (removing d' and l', only for French)
    • language-specific stopword removal
    • stemming
  • the product name is used as the query, and we're performing a match_phrase search, which means every word in the query must be found in the same order in the category name. The same processing is performed by Elasticsearch on the query before matching. Please note that Elasticsearch keeps an empty position index when removing stop words. For example, barre/1 de/2 chocolat/3 becomes barre/1 chocolat/3 after stop words removal, and a query "barre chocolat" (barre/1 chocolat/2) wouldn't return the document. To prevent this from happening, a custom version of Elasticsearch with a "gapless" plugin is used.

The issue comes from the fact that we the current matching strategy allow submatches (see examples above), which trigger too many false positive.

Possible strategy:

  • maybe there is a way to deal with this inside Elasticsearch, but I'm not aware of it (the keyword type which is used for full match doesn't seem to be usable with our processing here)
  • add checks after Elasticsearch query to see if we indeed have a full match
  • don't rely on Elasticsearch anymore for category name matching, add it inside Robotoff instead

I'm personally in favor of the 3rd option, as:

  • Elasticsearch is definitely overkill for this kind of matching: we're not matching documents, only a few thousands short names. And the custom version of Elasticsearch used is a major technical debt.
  • We would have more flexibility in the matching process
  • We can drop Elasticsearch requirement in the longer run (it's not ideal too to rely on Elasticsearch for spellcheck too, as it can be developed in Python too and be more flexible)
@alexgarel
Copy link
Member

This seems ok to me to drop Elasticsearch.

@stephanegigandet
Copy link

  • the product name is used as the query

That sounds really strange indeed, we should look for the category names inside the product name, not the reverse.

Just a regexp will all category names and synonyms should work I think (with preprocessing applied first). Same approach as used in #890

In fact we could have a separate module to match entries from taxonomies. It's what we now have on the Perl backend:
https://github.com/openfoodfacts/openfoodfacts-server/blob/main/lib/ProductOpener/Tags.pm#L4229

@raphael0202
Copy link
Collaborator Author

Looking for the category name inside the product name may also return false positives, as some ingredients are also categories (pineapple, passion fruits,...).
But I don't think this is that common, the simplest way to avoid this is to have a list of ingredients where we require a full match.

@raphael0202
Copy link
Collaborator Author

raphael0202 commented Nov 22, 2022

We're not using Elasticsearch anymore for category prediction, closing this issue.

Repository owner moved this from Todo to Done in 🤖 Artificial Intelligence @ Open Food Facts Nov 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants