-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Extract USDA packager codes with REGEX and flashtext #897
Conversation
…botoff into USDA_extraction
Codecov Report
@@ Coverage Diff @@
## master #897 +/- ##
==========================================
+ Coverage 44.73% 52.80% +8.06%
==========================================
Files 96 92 -4
Lines 6981 7034 +53
==========================================
+ Hits 3123 3714 +591
+ Misses 3858 3320 -538
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, still there is a edge case which is not handled correctly.
"eu_fr": OCRRegex( | ||
re.compile( | ||
r"fr (\d{2,3}|2[ab])[\-\s.](\d{3})[\-\s.](\d{3}) (ce|ec)(?![a-z0-9])" | ||
def process_USDA_match_to_flashtext(match) -> Optional[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit lazy to use Any here ;-)
Isn't it Optional[str] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I wasn't sure. I must admit that I am not really used to types. Thanks !
return generate_keyword_processor(codes) | ||
|
||
|
||
def extract_USDA_code(processor: KeywordProcessor, text: str) -> Optional[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given a string return the USDA code it contains (thanks to the processor) or None
@@ -15,6 +15,9 @@ | |||
("FR-AB0-123", []), | |||
("fr-098-123", []), | |||
("Gluten code is FR-234-234 ", ["FR-234-234"]), | |||
("EST \n 31778", ["EST. 31778"]), | |||
("EST \n 9999", []), | |||
("M31779+ P31779+ \tV31779", ["M31779 + P31779 + V31779"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have this, instead we should have:
("M31779+ P31779+ \tV31779", ["M31779 + P31779 + V31779"]), | |
("M31779+ P31779+ \tV31779", ["M31779", "P31779", "V31779"]), |
For this test to pass you will need to change your code.
For more information, see PackagerCodeAnnotator and how emb_code is a list of codes.
), | ||
# To match the USDA like "V34626" or "M34614 + P34614 + V34614" | ||
OCRRegex( | ||
re.compile(r"[A-Z]\d{1,5}[A-Z]?(\s*\+\s*[A-Z]\d{1,5}[A-Z]?){0,3}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to capture the " + " in one go, and not just each code ? This would simplify the task I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I did this for the same reason I did what you saw as a mistake two comments upwards. I thought that some products had codes which were explicitly "M34614 + P34614 + V34614". That's what written in the data @teolemon put in the Add support for the USDA packager codes #3704, Available as an XLSX.
…botoff into USDA_extraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! LGTM.
@GabrielBeFr you have a flake8 issue to resolve before we merge. |
@teolemon and @raphael0202 I have one question: on this kind of matching on a packager code, which is quite simple, don't we want to limit to either product with english text, or even products in the US ? (or North America) |
Kudos, SonarCloud Quality Gate passed! 0 Bugs |
🎉 @GabrielBeFr :-) |
What
Part of