Skip to content

Commit

Permalink
feat: update category matching algorithm (#952)
Browse files Browse the repository at this point in the history
* feat: updating spacy dependency

We need a more recent version of spaCy for lemmatization
Also add:
- spacy-lookups-data for lemmatizer lookup tables
- cachetools for TTLCache

* feat: add get_lemmatizing_nlp that returns pipeline with lemmatizer

* feat: add new category matching algorithm

* feat: switch from category ES matching to new matching algorithm

* feat: add matcher algorithm to /predict/category endpoint

* doc: document predict-category CLI command

* fix: improve category matching algorithm and APIs

after code review

* fix: fix mypy warning on OCR script

* fix: fix mkdocs building issue
  • Loading branch information
raphael0202 authored Oct 18, 2022
1 parent f97cc8c commit d8a04c7
Show file tree
Hide file tree
Showing 33 changed files with 1,229 additions and 1,744 deletions.
Binary file not shown.
Binary file not shown.
Binary file not shown.
100 changes: 93 additions & 7 deletions doc/references/api.yml
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,17 @@ paths:
description: |
The score above which we consider the category to be detected
default: 0.5
predictors:
type: array
description:
List of predictors to use, possible values are `matcher`
(simple matching algorithm) and `neural` (neural network categorizer)
items:
type: string
enum:
- neural
- matcher
example: ["neural", "matcher"]
required:
- barcode
- type: object
Expand All @@ -456,17 +467,21 @@ paths:
product_name:
type: string
minLength: 1
example: Frozen dinner yeast rolls
example: roasted chicken
ingredients_tags:
type: array
items:
type: string
example:
- "en:fortified-wheat-flour"
- "en:cereal"
- "en:flour"
- "en:chicken"
- "en:salts"
required:
- product_name
lang:
type: string
minLength: 1
description: Language of the product name, required for matcher algorithm
example: en
deepest_only:
type: boolean
description: |
Expand All @@ -477,6 +492,17 @@ paths:
description: |
The score above which we consider the category to be detected
default: 0.5
predictors:
type: array
description:
List of predictors to use, possible values are `matcher`
(simple matching algorithm) and `neural` (neural network categorizer)
items:
type: string
enum:
- neural
- matcher
example: ["neural", "matcher"]
required:
- product
responses:
Expand All @@ -495,16 +521,76 @@ paths:
value_tag:
type: string
description: The predicted `value_tag`
example: en:breads
example: en:roast-chicken
confidence:
type: number
description: The confidence score of the model
example: 0.6
required:
- value_tag
- confidence
required:
- neural
matcher:
type: array
items:
type: object
properties:
value_tag:
type: string
description: The predicted `value_tag`
example: en:roast-chicken
debug:
type: object
description: Additional debug information
properties:
pattern:
type: string
description: The pattern that matched the product name
example: roast chicken
lang:
type: string
description: The language of the matched pattern
example: en
product_name:
type: string
description: The product name that matched the category name
example: roasted chicken
processed_product_name:
type: string
description: The product name after preprocessing
(stemming, stop word removal,...)
example: roast chicken
category_name:
type: string
description:
The (localized) category name that matched the
product name
example: Roast chicken
start_idx:
type: integer
description: The string match start position
example: 0
end_idx:
type: integer
description: The string match end position
example: 13
is_full_match:
type: boolean
description:
If true, the processed product name matched completely with
the processed category name
example: true
required:
- pattern
- lang
- product_name
- processed_product_name
- category_name
- start_idx
- end_idx
- is_full_match
required:
- value_tag
- debug

components:
schemas:
Expand Down
Loading

0 comments on commit d8a04c7

Please sign in to comment.