Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically derive filters based on a clean sample provded by the user. #148

Open
PinzhenChen opened this issue Jan 17, 2024 · 5 comments
Assignees

Comments

@PinzhenChen
Copy link

In practice I would have big noisy training data and sample clean data that is representative of the downstream task (e.g. wmt validation sets).

It is still difficulty for me to decide on the values for the filters, for example, should I choose a source_word_ratio of 0.4 or 0.5, especially if I do not speak both languages. There are many filters and values to search for. This is largely empirical and it is also hard to attribute the final system's BLEU/COMET to a specific value change.

If I provide a small clean data that is sufficiently representative of the test domain, can the tool automatically run to derive some rules/values for me? Maybe the tool should search for and return the filter values that are "extreme" enough yet do not lead to the provided clean data being filtered out?

@PinzhenChen PinzhenChen self-assigned this Jan 17, 2024
@marco-c
Copy link

marco-c commented Jan 18, 2024

@PinzhenChen
Copy link
Author

@marco-c Thanks for pointing me to this! This is slightly different in the rule construction process as described but would achieve a similar effect!

@marco-c
Copy link

marco-c commented Jan 23, 2024

@PinzhenChen an idea we have been thinking about with @miau1 would be:

  1. Use https://helsinki-nlp.github.io/OpusFilter/automatic_configuration.html to generate an initial configuration
  2. Take a sample of clean / bad sentence pairs
  3. Validate them to actually be clean or bad
  4. Automatically adjust the rules to move the actually clean to the clean cluster and the actually bad to the bad cluster
  5. Repeat 2 - 4 until satisfied

@PinzhenChen
Copy link
Author

Oh I wasn't aware that you are already working on this. It reads very similar to my initial idea.

@marco-c
Copy link

marco-c commented Jan 23, 2024

Oh, we are not working on it yet, it was just an idea for now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants