feat: Batch job - Spellcheck #1401

jeremyarancio · 2024-08-21T15:45:19Z

Objective of the PR:

A new model based on Mistral-7B LLM has been developed to tackle the OCR-parsed ingredients spellcheck problem (see #spellcheck). While this solution doesn't fit Open Food Facts' production infrastructure, we've created a GPU-powered batch inference pipeline on GCP to process thousands of products.

See demo: Ingredient Spellcheck

What

Set up Batch Job with GCP for Robotoff Tasks.
We start with Ingredients-Spellcheck

Part of

Spellcheck work (link)
Discussion Robotoff (link)

…rfile ENTRYPOINT

… processing operational

…-launch

…ction with DuckDB and launch on GCP .

…ff sql tables

jeremyarancio · 2024-08-27T16:51:49Z

Rest to do:

Transfert batch launch from api endpoint to CLI
Set up API security key for Batch Data import, triggered by GCP batch job once finished
Improve InsightsImport to consider several use cases (see Robotoff doc)

There's no reason to configure the launch from endpoint. So we put in CLI instead of manual launch

raphael0202 · 2024-08-30T09:23:16Z

credentials/.gitkeep

Not needed anymore (as we're using an envvar)

raphael0202 · 2024-08-30T09:23:19Z

.gitignore

@@ -43,3 +43,5 @@ site/
 gh_pages/
 doc/README.md
 doc/references/cli.md
+
+credentials


Not needed anymore (as we're using an envvar)

raphael0202 · 2024-08-30T09:25:56Z

docker-compose.yml

@@ -4,6 +4,7 @@ x-robotoff-base-volumes:
  - ./cache:/opt/robotoff/cache
  - ./datasets:/opt/robotoff/datasets
  - ./models:/opt/robotoff/models
+  - ./credentials:/opt/credentials


Not needed anymore (as we're using an envvar)

robotoff/app/api.py

robotoff/batch/extraction.py

raphael0202 · 2024-08-30T12:46:07Z

robotoff/insights/importer.py

+        predictions: list[Prediction],
+        product_id: ProductIdentifier,
+    ) -> Iterator[ProductInsight]:
+        # Only one prediction


Possibly more than 1, no?

There's only one correction, no?
Or maybe I didn't understand what it means

There can be one correction per language, so more than 1 correction for each product

Simplify abstractions - Change data in insights instead of value - Other small changes

jeremyarancio · 2024-09-02T15:52:01Z

Black fails on os.getenv("BATCH_JOB_KEY")}, which is normal since it is not added to Github secret
This job key is used by the import_batch endpoint to secure it. Only the batch job docker image owns it.
We will have to add a key to Github secrets for the API, but also during the deployment of the image in GCP (CI/CD to make things easier?)

jeremyarancio · 2024-09-03T07:51:39Z

Quick note about error handling:
Falcon doesn't seem to handle custom exception error (such APITokenError(Exception)) and return status 500 instead by default

Enhance batch extraction with popularity_key - Add env variables to batch job - Add make deploy Spellcheck job to Artifact registry

jeremyarancio · 2024-09-04T07:14:19Z

I implemented the following changes:

Add make deploy-spellcheck in the Makefile
Modified the query to extract batch based on popularity_key
Delve into Google credentials to remove the credentilal.json folder necessity. We'll use a service_account instead (Need to generate it for production with the correct roles: incoming)
Added environment key to the batch job
Update Importer (still a bug, needs to solve it)
The overall process is tested, and everything looks fine. (Process 10000 products in 20 minutes)

jeremyarancio · 2024-09-04T10:50:42Z

Cannot explain why test fail with DeepSource, and not for other identical cases

We concluded that PREDICTOR-VERSION will be used to track batch jobs and allow new data predictions to be imported. In the future, we'll find a way to detect already processed data in another way, such as before the batch job during the extraction stage.

raphael0202 · 2024-09-05T09:01:23Z

Deepsource generates a lot of false positive. I thought about disabling it to be honest.

jeremyarancio · 2024-09-09T17:35:15Z

Improvement ideas:

Generate an unique id for each data processed in the bucket (reason: enable debugging)

raphael0202 · 2024-09-10T13:48:36Z

Good job! 🎉

jeremyarancio added 2 commits August 21, 2024 17:23

feat(Batch job - Spellcheck): ⚡

75ebf64

Merge branch 'main' into batch-job-spellcheck

a5830e2

github-actions bot added 🧪 tests Robotoff - App 🧪 unit tests labels Aug 21, 2024

github-actions bot assigned jeremyarancio Aug 21, 2024

jeremyarancio added 2 commits August 21, 2024 18:14

fix(batch-spellcheck): 💄 Fix Spellcheck Batch job file name for Docke…

eb15bab

…rfile ENTRYPOINT

feat(batch-spellcheck): ⚡ Batch extraction from database before Batch…

d36648b

… processing operational

github-actions bot added the utils label Aug 22, 2024

jeremyarancio added 2 commits August 23, 2024 11:38

refactor(batch-spellcheck): 💚 Fix some bugs: batch-extraction & batch…

c14338d

…-launch

feat(batch - spellcheck): ⚡ From predictions to insights

6c83b8c

github-actions bot added the insights label Aug 24, 2024

jeremyarancio added 2 commits August 26, 2024 19:02

feat(batch - spellcheck): ⚡ API endpoint batch/launch ok: Batch extra…

a369a59

…ction with DuckDB and launch on GCP .

feat(batch - spellcheck): ⚡ Integrate batch data from job into Roboto…

729d4e1

…ff sql tables

jeremyarancio changed the title ~~Batch job - Spellcheck~~ feat: Batch job - Spellcheck Aug 27, 2024

jeremyarancio added 2 commits August 27, 2024 18:39

feat: ✨ Restructure code

34ce80e

Merge branch 'main' into batch-job-spellcheck

f381ecb

jeremyarancio marked this pull request as ready for review August 27, 2024 16:47

jeremyarancio requested a review from a team as a code owner August 27, 2024 16:48

feat: ✨ Change batch job launch from api endpoint to CLI

92cb5f3

There's no reason to configure the launch from endpoint. So we put in CLI instead of manual launch

github-actions bot added the Robotoff - CLI label Aug 28, 2024

jeremyarancio added 2 commits August 28, 2024 16:26

feat: 🔒 Secure Batch Data Import endpoint with a token key

54f1734

feat: 🎨 Add key during request by the batch job

4aabf4b

raphael0202 requested changes Aug 30, 2024

View reviewed changes

jeremyarancio added 2 commits September 2, 2024 17:28

feat: ✨ Implemenation reviews

01d884a

Simplify abstractions - Change data in insights instead of value - Other small changes

style: ✨ make lint

fda7b5d

fix: 🐛 Fixed bug & Better error handling with Falcon

f8ed76a

jeremyarancio added 2 commits September 3, 2024 15:55

feat: 🚑 Changes

85b7bfb

Enhance batch extraction with popularity_key - Add env variables to batch job - Add make deploy Spellcheck job to Artifact registry

feat: 🚑 Credential + Importer

31ce875

jeremyarancio added 4 commits September 4, 2024 12:19

feat: 🚑 Credentials + Importer + Test

7c92836

Merge branch 'main' into batch-job-spellcheck

cb49cd9

feat: 🐛 Forgot a return

be475bd

style: ✨ Black on spellcheck script

762722f

jeremyarancio added 3 commits September 4, 2024 18:35

docs: 📝 Add batch/import api endpoint to doc

10791e7

docs: 📝 Because perfection

400818b

Merge branch 'main' into batch-job-spellcheck

b0b15b4

raphael0202 self-requested a review September 10, 2024 13:49

raphael0202 approved these changes Sep 10, 2024

View reviewed changes

raphael0202 mentioned this pull request Sep 10, 2024

feat: batch job spellcheck #1409

Merged

raphael0202 closed this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Batch job - Spellcheck #1401

feat: Batch job - Spellcheck #1401

jeremyarancio commented Aug 21, 2024 •

edited

Loading

jeremyarancio commented Aug 27, 2024

raphael0202 Aug 30, 2024

raphael0202 Aug 30, 2024

raphael0202 Aug 30, 2024

raphael0202 Aug 30, 2024

jeremyarancio Sep 2, 2024

raphael0202 Sep 3, 2024

jeremyarancio commented Sep 2, 2024

jeremyarancio commented Sep 3, 2024

jeremyarancio commented Sep 4, 2024

jeremyarancio commented Sep 4, 2024

raphael0202 commented Sep 5, 2024

jeremyarancio commented Sep 9, 2024

raphael0202 commented Sep 10, 2024

feat: Batch job - Spellcheck #1401

feat: Batch job - Spellcheck #1401

Conversation

jeremyarancio commented Aug 21, 2024 • edited Loading

Objective of the PR:

What

Part of

jeremyarancio commented Aug 27, 2024

raphael0202 Aug 30, 2024

Choose a reason for hiding this comment

raphael0202 Aug 30, 2024

Choose a reason for hiding this comment

raphael0202 Aug 30, 2024

Choose a reason for hiding this comment

raphael0202 Aug 30, 2024

Choose a reason for hiding this comment

jeremyarancio Sep 2, 2024

Choose a reason for hiding this comment

raphael0202 Sep 3, 2024

Choose a reason for hiding this comment

jeremyarancio commented Sep 2, 2024

jeremyarancio commented Sep 3, 2024

jeremyarancio commented Sep 4, 2024

jeremyarancio commented Sep 4, 2024

raphael0202 commented Sep 5, 2024

jeremyarancio commented Sep 9, 2024

raphael0202 commented Sep 10, 2024

jeremyarancio commented Aug 21, 2024 •

edited

Loading