Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: run OCR extraction on every new image #543

Merged
merged 2 commits into from
Nov 4, 2024
Merged

Conversation

raphael0202
Copy link
Contributor

@raphael0202 raphael0202 commented Oct 30, 2024

Fixes #320

@github-actions github-actions bot added the GitHub Actions Pull requests that update GitHub Actions code label Oct 30, 2024
@raphael0202 raphael0202 merged commit 77ed50b into main Nov 4, 2024
10 checks passed
@raphael0202 raphael0202 deleted the run-ocr-extraction branch November 4, 2024 12:59
data["created_at"] = int(time.time())

with gzip.open(ocr_json_path, "wt") as f:
f.write(json.dumps(data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it stores the result in a jsonl.gz file next to the image ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we re-run on the same image, will it override ?

Copy link
Contributor Author

@raphael0202 raphael0202 Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it stores the result in a jsonl.gz file next to the image ?

Yes exactly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we re-run on the same image, will it override ?

It depends on the value of override

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ouiii bien vu !

@albertaillet
Copy link
Collaborator

In this script openfoodfacts-server/blob/main/scripts/run_ocr.py you also run

"features": [
    {"type": "TEXT_DETECTION"},
    {"type": "LOGO_DETECTION"},
    {"type": "LABEL_DETECTION"},
    {"type": "SAFE_SEARCH_DETECTION"},
    {"type": "FACE_DETECTION"},
],

not only TEXT_DETECTION, would this be of interest here as well?

@raphodn
Copy link
Member

raphodn commented Nov 6, 2024

todo : avoid running OCR extraction in the testruns + delete test image ?

@@ -75,6 +77,10 @@ def upload(self, request: Request) -> Response:
status=status.HTTP_400_BAD_REQUEST,
)
file_path, mimetype, image_thumb_path = store_file(request.data.get("file"))
async_task(
Copy link
Member

@raphodn raphodn Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to refactor this bit to add it to the post_save signal instead
similar to what is done with locations (OSM) & products (OFF)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in #549

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GitHub Actions Pull requests that update GitHub Actions code OCR
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Proof: send to Google Cloud Vision, and store the matching JSON output
3 participants