-
-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Batch job - Spellcheck #1401
feat: Batch job - Spellcheck #1401
Conversation
… processing operational
…ction with DuckDB and launch on GCP .
Rest to do:
|
There's no reason to configure the launch from endpoint. So we put in CLI instead of manual launch
credentials/.gitkeep
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore (as we're using an envvar)
.gitignore
Outdated
@@ -43,3 +43,5 @@ site/ | |||
gh_pages/ | |||
doc/README.md | |||
doc/references/cli.md | |||
|
|||
credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore (as we're using an envvar)
docker-compose.yml
Outdated
@@ -4,6 +4,7 @@ x-robotoff-base-volumes: | |||
- ./cache:/opt/robotoff/cache | |||
- ./datasets:/opt/robotoff/datasets | |||
- ./models:/opt/robotoff/models | |||
- ./credentials:/opt/credentials |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed anymore (as we're using an envvar)
robotoff/insights/importer.py
Outdated
predictions: list[Prediction], | ||
product_id: ProductIdentifier, | ||
) -> Iterator[ProductInsight]: | ||
# Only one prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly more than 1, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's only one correction, no?
Or maybe I didn't understand what it means
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be one correction per language, so more than 1 correction for each product
Simplify abstractions - Change data in insights instead of value - Other small changes
Black fails on |
Quick note about error handling: |
Enhance batch extraction with popularity_key - Add env variables to batch job - Add make deploy Spellcheck job to Artifact registry
I implemented the following changes:
|
We concluded that PREDICTOR-VERSION will be used to track batch jobs and allow new data predictions to be imported. In the future, we'll find a way to detect already processed data in another way, such as before the batch job during the extraction stage.
Deepsource generates a lot of false positive. I thought about disabling it to be honest. |
Improvement ideas:
|
Good job! 🎉 |
Objective of the PR:
A new model based on Mistral-7B LLM has been developed to tackle the OCR-parsed ingredients spellcheck problem (see #spellcheck). While this solution doesn't fit Open Food Facts' production infrastructure, we've created a GPU-powered batch inference pipeline on GCP to process thousands of products.
See demo: Ingredient Spellcheck
What
Part of