Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Batch job - Spellcheck #1401

Closed

Conversation

jeremyarancio
Copy link
Contributor

@jeremyarancio jeremyarancio commented Aug 21, 2024

Objective of the PR:

A new model based on Mistral-7B LLM has been developed to tackle the OCR-parsed ingredients spellcheck problem (see #spellcheck). While this solution doesn't fit Open Food Facts' production infrastructure, we've created a GPU-powered batch inference pipeline on GCP to process thousands of products.

See demo: Ingredient Spellcheck

What

  • Set up Batch Job with GCP for Robotoff Tasks.
  • We start with Ingredients-Spellcheck

Part of

  • Spellcheck work (link)
  • Discussion Robotoff (link)

@github-actions github-actions bot added the utils label Aug 22, 2024
@jeremyarancio jeremyarancio changed the title Batch job - Spellcheck feat: Batch job - Spellcheck Aug 27, 2024
@jeremyarancio jeremyarancio marked this pull request as ready for review August 27, 2024 16:47
@jeremyarancio jeremyarancio requested a review from a team as a code owner August 27, 2024 16:48
@jeremyarancio
Copy link
Contributor Author

Rest to do:

  • Transfert batch launch from api endpoint to CLI
  • Set up API security key for Batch Data import, triggered by GCP batch job once finished
  • Improve InsightsImport to consider several use cases (see Robotoff doc)

There's no reason to configure the launch from endpoint. So we put in CLI instead of manual launch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore (as we're using an envvar)

.gitignore Outdated
@@ -43,3 +43,5 @@ site/
gh_pages/
doc/README.md
doc/references/cli.md

credentials
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore (as we're using an envvar)

@@ -4,6 +4,7 @@ x-robotoff-base-volumes:
- ./cache:/opt/robotoff/cache
- ./datasets:/opt/robotoff/datasets
- ./models:/opt/robotoff/models
- ./credentials:/opt/credentials
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore (as we're using an envvar)

robotoff/app/api.py Outdated Show resolved Hide resolved
robotoff/app/api.py Outdated Show resolved Hide resolved
robotoff/batch/extraction.py Outdated Show resolved Hide resolved
robotoff/batch/extraction.py Outdated Show resolved Hide resolved
robotoff/batch/extraction.py Outdated Show resolved Hide resolved
robotoff/batch/extraction.py Outdated Show resolved Hide resolved
predictions: list[Prediction],
product_id: ProductIdentifier,
) -> Iterator[ProductInsight]:
# Only one prediction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly more than 1, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one correction, no?
Or maybe I didn't understand what it means

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be one correction per language, so more than 1 correction for each product

Simplify abstractions - Change data in insights instead of value - Other small changes
@jeremyarancio
Copy link
Contributor Author

Black fails on os.getenv("BATCH_JOB_KEY")}, which is normal since it is not added to Github secret
This job key is used by the import_batch endpoint to secure it. Only the batch job docker image owns it.
We will have to add a key to Github secrets for the API, but also during the deployment of the image in GCP (CI/CD to make things easier?)

@jeremyarancio
Copy link
Contributor Author

Quick note about error handling:
Falcon doesn't seem to handle custom exception error (such APITokenError(Exception)) and return status 500 instead by default

Enhance batch extraction with popularity_key - Add env variables to batch job - Add make deploy Spellcheck job to Artifact registry
@jeremyarancio
Copy link
Contributor Author

I implemented the following changes:

  • Add make deploy-spellcheck in the Makefile
  • Modified the query to extract batch based on popularity_key
  • Delve into Google credentials to remove the credentilal.json folder necessity. We'll use a service_account instead (Need to generate it for production with the correct roles: incoming)
  • Added environment key to the batch job
  • Update Importer (still a bug, needs to solve it)
  • The overall process is tested, and everything looks fine. (Process 10000 products in 20 minutes)

@jeremyarancio
Copy link
Contributor Author

Cannot explain why test fail with DeepSource, and not for other identical cases

image

We concluded that PREDICTOR-VERSION will be used to track batch jobs and allow new data predictions to be imported. In the future, we'll find a way to detect already processed data in another way, such as before the batch job during the extraction stage.
@raphael0202
Copy link
Collaborator

Deepsource generates a lot of false positive. I thought about disabling it to be honest.

@jeremyarancio
Copy link
Contributor Author

Improvement ideas:

  • Generate an unique id for each data processed in the bucket (reason: enable debugging)

@raphael0202
Copy link
Collaborator

Good job! 🎉

@raphael0202 raphael0202 self-requested a review September 10, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants