Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve parsing of 'category (type 1, type 2..)' ingredients #10999

Merged
merged 24 commits into from
Dec 10, 2024

Conversation

stephanegigandet
Copy link
Contributor

PR to better handle things like "vegetal oil (palm, rapeseed)":

  • instead of turning "vegetal oil (palm, rapeseed)" to "palm vegetal oil", "rapeseed vegetal oil", we now turn it to "vegetal oil (palm vegetal oil, rapeseed vegetal oil)", as keeping a parent ingredient is better for ingredient percent estimation
  • improved the definition of all the variations of "huile et stéarine végétales non hydrogénées (colza, palme)" to have better coverage
  • added support for percentages like "huiles végétales 54% (colza, palme)"

Work in progress, some tests will need to be updated.

@stephanegigandet stephanegigandet requested a review from a team as a code owner November 8, 2024 17:41
@stephanegigandet
Copy link
Contributor Author

/update_tests_results

@github-actions github-actions bot added 🧪 tests 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis 🥗 Ingredients labels Nov 8, 2024
@github-actions github-actions bot added the GitHub Actions Pull requests that update Github_actions code label Nov 27, 2024
@github-actions github-actions bot added the 💥 Merge Conflicts 💥 Merge Conflicts label Nov 28, 2024
@github-actions github-actions bot added 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies categories and removed 💥 Merge Conflicts 💥 Merge Conflicts labels Nov 29, 2024
@github-actions github-actions bot added the 💥 Merge Conflicts 💥 Merge Conflicts label Dec 5, 2024
@github-actions github-actions bot removed the 💥 Merge Conflicts 💥 Merge Conflicts label Dec 5, 2024
@stephanegigandet stephanegigandet changed the title fix: improve parsing of 'category (type 1, type 2..)' ingredients WIP fix: improve parsing of 'category (type 1, type 2..)' ingredients Dec 6, 2024
Copy link

sonarqubecloud bot commented Dec 9, 2024

@stephanegigandet stephanegigandet enabled auto-merge (squash) December 9, 2024 14:13
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a review but it's a bit cryptic to me in the hard parts, so I will believe the tests !

@@ -174,6 +174,10 @@ my $separators_except_comma = qr/(;|:|$middle_dot|\[|\{|\(|\N{U+FF08}|( $dashes

my $separators = qr/($stops\s|$commas|$separators_except_comma)/i;

# Symbols to indicate labels like organic, fairtrade etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Symbols to indicate labels like organic, fairtrade etc.
# Symbols to indicate labels like organic, fairtrade etc.
# like in "pomodoro*, oignons*. (* indicates organic ingredients)"


my %percent_or_quantity_regexps = ();

sub init_percent_or_quantity_regexps($ingredients_lc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great you separated that !

Comment on lines +4578 to +4583
if ($ingredients_lc eq "en") {
$ingredient =~ s/(?:organic |fair trade )*//ig;
}
elsif ($ingredients_lc eq "fr") {
$ingredient =~ s/(?: bio| biologique| équitable|s|\s|' . $symbols_regexp . ')//ig;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(future improvement) Maybe we could use a "used_in_ingredients" property in labels taxonomy to get them.

It's a bit of a pity we don't find this in german etc.

At least we could use taxonomy entries for organic and fair trade ?

@stephanegigandet stephanegigandet merged commit 42618ac into main Dec 10, 2024
15 checks passed
@stephanegigandet stephanegigandet deleted the category-types-ingredients branch December 10, 2024 11:12
stephanegigandet pushed a commit that referenced this pull request Dec 11, 2024
🤖 I have created a release *beep* *boop*
---


##
[2.51.0](v2.50.0...v2.51.0)
(2024-12-10)


### Features

* Add script to remove nearly empty products with quality issues
([#11058](#11058))
([82726d5](82726d5))
* NOVA 4 attribute and knowledge panel improvements
([#11035](#11035))
([9048011](9048011))


### Bug Fixes

* additives table + clean HTML to remove some validation errors
([#11093](#11093))
([474f68d](474f68d))
* avoid crash if ingredients services called without ingredients_lc
([#11055](#11055))
([1db3e94](1db3e94))
* data quality, false positive, nutrition sum with lower symbol
([#11076](#11076))
([d389c87](d389c87))
* data quality, false positive, nutrition sum with lower symbol for milk
below the table
([#11098](#11098))
([7febb69](7febb69))
* display of usage in scripts/import_csv_file.pl
([#11091](#11091))
([91881f8](91881f8))
* improve parsing of 'category (type 1, type 2..)' ingredients
([#10999](#10999))
([42618ac](42618ac))
* letter A at end of string is not a stopword
([#11095](#11095))
([6eaeb26](6eaeb26))
* Load products in mongodb
([#11072](#11072))
([6787ba1](6787ba1))
* new images path
([#11096](#11096))
([8658959](8658959))
* pro platform product writes to the public platform MongoDB database
([#11065](#11065))
([f77eb82](f77eb82))
* product image move
[#11067](#11067)
([#11092](#11092))
([30257c1](30257c1))
* remove warning in ecobalyse matching of ingredients
([#11062](#11062))
([c29fce9](c29fce9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
categories GitHub Actions Pull requests that update Github_actions code 🥗🔍 Ingredients analysis https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis Ingredients processing 🥗 Ingredients 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies 🧪 tests
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants