Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) #365

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

korablique
Copy link

This is the research conducted in the next issue: #349

01_extract_data.py: extracts all texts with their languages from huggingface dataset.

02_select_short_texts_with_known_ingredients.py: filters texts with length up to 10 words, performs ingredient analysis by OFF API, selects ingredient texts with at least 80% of known ingredients, adds short texts from manually checked data.

What is manually checked data:
I created a validation dataset from texts from OFF (42 languages, 15-30 texts per language).
I took 30 random texts in each language, obtained language predictions using the Deepl API and two other models (language-detection-fine-tuned-on-xlm-roberta-base and multilingual-e5-language-detection). For languages they don’t support, I used Google Translate and ChatGPT for verification. (As a result, after correcting the labels, some languages have fewer than 30 texts).

03_calculate_metrics.py: obtains predictions by FastText and lingua language detector models for texts up to 10 words long, and calculates precision, recall and f1-score.

Results are in files: 10_words_metrics.csv, fasttext_confusion_matrix.csv, lingua_confusion_matrix.csv.

It turned out that both models demonstrate low precision and high recall for some languages (indicating that the threshold might be too high and should be adjusted).

short ingredient texts with precision-recall evaluation (openfoodfacts#349)
Copy link
Collaborator

@baslia baslia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be great to have a separate file (like 04_inference.py) dedicated for inference, then it would be easy to warp the code and deploy it in the future

@korablique korablique changed the title analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) chore: analyze language identification models performance on short ingredient texts with precision-recall evaluation (#349) Dec 1, 2024
@korablique
Copy link
Author

@baslia Could you please help me with this error? https://github.com/openfoodfacts/openfoodfacts-ai/actions/runs/12106320879/job/33751933052?pr=365
I haven't worked with labeler before, don't understand how to fix this issue

@baslia baslia added the ✨ enhancement New feature or request label Dec 3, 2024
@baslia
Copy link
Collaborator

baslia commented Dec 3, 2024

@baslia Could you please help me with this error? https://github.com/openfoodfacts/openfoodfacts-ai/actions/runs/12106320879/job/33751933052?pr=365 I haven't worked with labeler before, don't understand how to fix this issue

Hey, I thought this was just a check about the PR label attribute, I just attached the label "enhancement".
But this isn't working, but I don't think this test is a big deal

@raphael0202
Copy link
Contributor

Yes indeed it's a configuration issue from the repo, it's safe to ignore!
I haven't have time to review your work yet, I will find some time to do it next week!
Thank you for pushing the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants