Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Extract USDA packager codes with REGEX and flashtext #897

Merged
merged 19 commits into from
Sep 20, 2022

Conversation

GabrielBeFr
Copy link
Contributor

What

  • This PR is solving the issue Add support to flashtext USDA packager codes #466, allowing Robotoff to detect USDA packager codes.
  • I added a data file containing all the USDA codes that exist according to the United States Department of Agriculture. It is read by the file packager_code.py via a flashtext to match possible USDA codes detected via REGEX with existing codes.
  • I thus added a few functions in the file packager_code.py involved in the process_func associated to USDA codes in the REGEX dict.

Part of

@codecov
Copy link

codecov bot commented Sep 14, 2022

Codecov Report

Merging #897 (cdab35f) into master (1e63ca2) will increase coverage by 8.06%.
The diff coverage is 73.99%.

@@            Coverage Diff             @@
##           master     #897      +/-   ##
==========================================
+ Coverage   44.73%   52.80%   +8.06%     
==========================================
  Files          96       92       -4     
  Lines        6981     7034      +53     
==========================================
+ Hits         3123     3714     +591     
+ Misses       3858     3320     -538     
Impacted Files Coverage Δ
robotoff/cli/batch.py 0.00% <ø> (ø)
robotoff/cli/insights.py 0.00% <0.00%> (ø)
robotoff/insights/dataclass.py 100.00% <ø> (ø)
robotoff/prediction/ocr/brand.py 68.62% <0.00%> (ø)
robotoff/prediction/ocr/expiration_date.py 25.71% <ø> (ø)
robotoff/prediction/ocr/label.py 72.30% <0.00%> (ø)
robotoff/prediction/ocr/product_weight.py 49.10% <0.00%> (+1.28%) ⬆️
robotoff/products.py 42.75% <ø> (+2.26%) ⬆️
robotoff/workers/listener.py 0.00% <0.00%> (ø)
robotoff/metrics.py 23.61% <10.00%> (-2.20%) ⬇️
... and 62 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, still there is a edge case which is not handled correctly.

"eu_fr": OCRRegex(
re.compile(
r"fr (\d{2,3}|2[ab])[\-\s.](\d{3})[\-\s.](\d{3}) (ce|ec)(?![a-z0-9])"
def process_USDA_match_to_flashtext(match) -> Optional[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit lazy to use Any here ;-)
Isn't it Optional[str] ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I wasn't sure. I must admit that I am not really used to types. Thanks !

robotoff/prediction/ocr/packager_code.py Show resolved Hide resolved
return generate_keyword_processor(codes)


def extract_USDA_code(processor: KeywordProcessor, text: str) -> Optional[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given a string return the USDA code it contains (thanks to the processor) or None

@@ -15,6 +15,9 @@
("FR-AB0-123", []),
("fr-098-123", []),
("Gluten code is FR-234-234 ", ["FR-234-234"]),
("EST \n 31778", ["EST. 31778"]),
("EST \n 9999", []),
("M31779+ P31779+ \tV31779", ["M31779 + P31779 + V31779"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have this, instead we should have:

Suggested change
("M31779+ P31779+ \tV31779", ["M31779 + P31779 + V31779"]),
("M31779+ P31779+ \tV31779", ["M31779", "P31779", "V31779"]),

For this test to pass you will need to change your code.

For more information, see PackagerCodeAnnotator and how emb_code is a list of codes.

robotoff/prediction/ocr/packager_code.py Show resolved Hide resolved
),
# To match the USDA like "V34626" or "M34614 + P34614 + V34614"
OCRRegex(
re.compile(r"[A-Z]\d{1,5}[A-Z]?(\s*\+\s*[A-Z]\d{1,5}[A-Z]?){0,3}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to capture the " + " in one go, and not just each code ? This would simplify the task I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did this for the same reason I did what you saw as a mistake two comments upwards. I thought that some products had codes which were explicitly "M34614 + P34614 + V34614". That's what written in the data @teolemon put in the Add support for the USDA packager codes #3704, Available as an XLSX.

robotoff/prediction/ocr/packager_code.py Show resolved Hide resolved
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM.

@alexgarel
Copy link
Member

@GabrielBeFr you have a flake8 issue to resolve before we merge.

@alexgarel
Copy link
Member

@teolemon and @raphael0202 I have one question: on this kind of matching on a packager code, which is quite simple, don't we want to limit to either product with english text, or even products in the US ? (or North America)

@sonarcloud
Copy link

sonarcloud bot commented Sep 19, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 6 Security Hotspots
Code Smell A 122 Code Smells

0.0% 0.0% Coverage
0.1% 0.1% Duplication

@GabrielBeFr GabrielBeFr merged commit 6923536 into master Sep 20, 2022
@GabrielBeFr GabrielBeFr deleted the USDA_extraction branch September 20, 2022 12:16
@alexgarel
Copy link
Member

🎉 @GabrielBeFr :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants