Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_mismatch discards some useful entries #132

Open
gregtatum opened this issue Nov 15, 2023 · 5 comments
Open

num_mismatch discards some useful entries #132

gregtatum opened this issue Nov 15, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@gregtatum
Copy link
Contributor

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way.

English Spanish
It is a concert version of an opera, which is part of the 2011/2012 opera season at the Teatro Real Theatre in Madrid. Se trata de una ópera en versión de concierto que forma parte de la temporada 2011-2012 del Teatro Real de Madrid.
32.7 x 22.5 cm 32, 7 x 22,5 cm
In Spain there are over three million companies, of which around 2,000 are estimated to be start-ups. En España, existen más de tres millones de empresas, de cuáles se estima que alrededor de 2000 son startups.
10:00 -13:30 / 16:30-19:00 10.00 -13.30 h. / 16.30-19.00 h.
5-12 May Del 5 al 12 de mayo
@XapaJIaMnu
Copy link
Collaborator

I guess we should update the filter to maybe strip punctuation?
5-12 May is hard to match though

@marco-c
Copy link

marco-c commented Nov 20, 2023

Is there a possibilty to have language-specific rules such as "NUM-NUM" can be "Del NUM al NUM"?

@jelmervdl
Copy link
Collaborator

It is, but I'd rather avoid such rules where possible. I don't think there's a need to be so exact. This is more an issue with how the dash is interpreted (as a minus sign for the latter number) than anything.

@jelmervdl jelmervdl added the bug Something isn't working label Nov 23, 2023
@jelmervdl jelmervdl self-assigned this Nov 23, 2023
@jelmervdl
Copy link
Collaborator

So running the lines above through num_filter.py --debug, I'd get the following ratios:
(Note that punctuation inside numbers, e.g to mark thousands or decimals, is replaced by *)

#n. len(overlap) / len(diff) : overlapping | differences
1. 2 / 0 : {'2011', '2012'} | set()
2. 1 / 3 : {'22*5'} | {'7', '32*7', '32'}
3. 0 / 2 : set() | {'2*000', '2000'}
4. 0 / 10 : set() | {'19*00', '16*30', '30', '10*00', '-13', '10', '16', '0', '19', '-13*30'}
5. 2 / 0 : {'5', '12'} | set()

Line 1 and 5 should not be discarded according to this.

Line 2 is confused by the space and I'm unsure how to deal with that. I could add a \s? to the regex, but I'd imagine a line with say 2, 3, 4 pears not wanting to be interpreted as the number 2*3*4.

Line 3 is because it doesn't know that 2,000 and 2000 is the same. Punctuation is replaced by * to account for the difference between '750 and 7.50'. Maybe this is unnecessary. Or maybe it should be smarter about normalising numbers, but that sounds tricky 2,000 (thousands) and 2,0 (decimal) and 2,000.00 but unlikely 2.00 (time? Or price?)

Line 4 suffers from the space before the dash again but at least that's consistent on both sides. It also doesn't know about : in times, when I add that to the regex, it's matching the numbers correctly.

jelmervdl added a commit that referenced this issue Nov 23, 2023
@PinzhenChen
Copy link

PinzhenChen commented Jan 17, 2024

Numeric-aware embeddings as an extension to LASER/LaBSE?

In addition, we can train emb(112 km) == emb(70 miles). Can create training data by syntactically augmenting existing data. Not sure whether this sense of parallelism helps NMT performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants