-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many incorrect predictions with Elasticsearch category predictor #918
Comments
This seems ok to me to drop Elasticsearch. |
That sounds really strange indeed, we should look for the category names inside the product name, not the reverse. Just a regexp will all category names and synonyms should work I think (with preprocessing applied first). Same approach as used in #890 In fact we could have a separate module to match entries from taxonomies. It's what we now have on the Perl backend: |
Looking for the category name inside the product name may also return false positives, as some ingredients are also categories (pineapple, passion fruits,...). |
We're not using Elasticsearch anymore for category prediction, closing this issue. |
On the 25 last elasticsearch predicted categories, only 11 are accurate. This is due to the matching algorithm, that allows over-matching.
Examples of incorrect matches:
For reference, here is how elasticsearch category matching is performed:
category
index. We currently support French, English, Spanish and German. The following processing is performed on all product names of supported languages:match_phrase
search, which means every word in the query must be found in the same order in the category name. The same processing is performed by Elasticsearch on the query before matching. Please note that Elasticsearch keeps an empty position index when removing stop words. For example, barre/1 de/2 chocolat/3 becomes barre/1 chocolat/3 after stop words removal, and a query "barre chocolat" (barre/1 chocolat/2) wouldn't return the document. To prevent this from happening, a custom version of Elasticsearch with a "gapless" plugin is used.The issue comes from the fact that we the current matching strategy allow submatches (see examples above), which trigger too many false positive.
Possible strategy:
I'm personally in favor of the 3rd option, as:
The text was updated successfully, but these errors were encountered: