Automatically derive filters based on a clean sample provded by the user. #148

PinzhenChen · 2024-01-17T15:45:50Z

In practice I would have big noisy training data and sample clean data that is representative of the downstream task (e.g. wmt validation sets).

It is still difficulty for me to decide on the values for the filters, for example, should I choose a source_word_ratio of 0.4 or 0.5, especially if I do not speak both languages. There are many filters and values to search for. This is largely empirical and it is also hard to attribute the final system's BLEU/COMET to a specific value change.

If I provide a small clean data that is sufficiently representative of the test domain, can the tool automatically run to derive some rules/values for me? Maybe the tool should search for and return the filter values that are "extreme" enough yet do not lead to the provided clean data being filtered out?

marco-c · 2024-01-18T16:58:43Z

@PinzhenChen WDYT of https://helsinki-nlp.github.io/OpusFilter/automatic_configuration.html?

PinzhenChen · 2024-01-23T00:37:58Z

@marco-c Thanks for pointing me to this! This is slightly different in the rule construction process as described but would achieve a similar effect!

marco-c · 2024-01-23T08:57:24Z

@PinzhenChen an idea we have been thinking about with @miau1 would be:

Use https://helsinki-nlp.github.io/OpusFilter/automatic_configuration.html to generate an initial configuration
Take a sample of clean / bad sentence pairs
Validate them to actually be clean or bad
Automatically adjust the rules to move the actually clean to the clean cluster and the actually bad to the bad cluster
Repeat 2 - 4 until satisfied

PinzhenChen · 2024-01-23T12:36:06Z

Oh I wasn't aware that you are already working on this. It reads very similar to my initial idea.

marco-c · 2024-01-23T12:37:05Z

Oh, we are not working on it yet, it was just an idea for now!

PinzhenChen self-assigned this Jan 17, 2024

marco-c mentioned this issue Jan 23, 2024

Add community contribution guidelines mozilla/firefox-translations-training#387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically derive filters based on a clean sample provded by the user. #148

Automatically derive filters based on a clean sample provded by the user. #148

PinzhenChen commented Jan 17, 2024

marco-c commented Jan 18, 2024

PinzhenChen commented Jan 23, 2024

marco-c commented Jan 23, 2024

PinzhenChen commented Jan 23, 2024

marco-c commented Jan 23, 2024

Automatically derive filters based on a clean sample provded by the user. #148

Automatically derive filters based on a clean sample provded by the user. #148

Comments

PinzhenChen commented Jan 17, 2024

marco-c commented Jan 18, 2024

PinzhenChen commented Jan 23, 2024

marco-c commented Jan 23, 2024

PinzhenChen commented Jan 23, 2024

marco-c commented Jan 23, 2024