num_mismatch discards some useful entries #132

gregtatum · 2023-11-15T14:55:35Z

I'm checking out this rule, and found some entries that were discarded which seemed valid to me. Mostly punctuation seems to be getting in the way.

English	Spanish
It is a concert version of an opera, which is part of the 2011/2012 opera season at the Teatro Real Theatre in Madrid.	Se trata de una ópera en versión de concierto que forma parte de la temporada 2011-2012 del Teatro Real de Madrid.
32.7 x 22.5 cm	32, 7 x 22,5 cm
In Spain there are over three million companies, of which around 2,000 are estimated to be start-ups.	En España, existen más de tres millones de empresas, de cuáles se estima que alrededor de 2000 son startups.
10:00 -13:30 / 16:30-19:00	10.00 -13.30 h. / 16.30-19.00 h.
5-12 May	Del 5 al 12 de mayo

XapaJIaMnu · 2023-11-15T15:56:39Z

I guess we should update the filter to maybe strip punctuation?
5-12 May is hard to match though

marco-c · 2023-11-20T16:38:13Z

Is there a possibilty to have language-specific rules such as "NUM-NUM" can be "Del NUM al NUM"?

jelmervdl · 2023-11-23T16:10:17Z

It is, but I'd rather avoid such rules where possible. I don't think there's a need to be so exact. This is more an issue with how the dash is interpreted (as a minus sign for the latter number) than anything.

jelmervdl · 2023-11-23T16:28:00Z

So running the lines above through num_filter.py --debug, I'd get the following ratios:
(Note that punctuation inside numbers, e.g to mark thousands or decimals, is replaced by *)

#n. len(overlap) / len(diff) : overlapping | differences
1. 2 / 0 : {'2011', '2012'} | set()
2. 1 / 3 : {'22*5'} | {'7', '32*7', '32'}
3. 0 / 2 : set() | {'2*000', '2000'}
4. 0 / 10 : set() | {'19*00', '16*30', '30', '10*00', '-13', '10', '16', '0', '19', '-13*30'}
5. 2 / 0 : {'5', '12'} | set()

Line 1 and 5 should not be discarded according to this.

Line 2 is confused by the space and I'm unsure how to deal with that. I could add a \s? to the regex, but I'd imagine a line with say 2, 3, 4 pears not wanting to be interpreted as the number 2*3*4.

Line 3 is because it doesn't know that 2,000 and 2000 is the same. Punctuation is replaced by * to account for the difference between '750 and 7.50'. Maybe this is unnecessary. Or maybe it should be smarter about normalising numbers, but that sounds tricky 2,000 (thousands) and 2,0 (decimal) and 2,000.00 but unlikely 2.00 (time? Or price?)

Line 4 suffers from the space before the dash again but at least that's consistent on both sides. It also doesn't know about : in times, when I add that to the regex, it's matching the numbers correctly.

Re #132.

PinzhenChen · 2024-01-17T15:50:00Z

Numeric-aware embeddings as an extension to LASER/LaBSE?

In addition, we can train emb(112 km) == emb(70 miles). Can create training data by syntactically augmenting existing data. Not sure whether this sense of parallelism helps NMT performance.

jelmervdl added the bug Something isn't working label Nov 23, 2023

jelmervdl self-assigned this Nov 23, 2023

jelmervdl added a commit that referenced this issue Nov 23, 2023

Add time separator to num_mismatch

c13aed6

Re #132.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_mismatch discards some useful entries #132

num_mismatch discards some useful entries #132

gregtatum commented Nov 15, 2023

XapaJIaMnu commented Nov 15, 2023

marco-c commented Nov 20, 2023

jelmervdl commented Nov 23, 2023

jelmervdl commented Nov 23, 2023

PinzhenChen commented Jan 17, 2024 •

edited

Loading

num_mismatch discards some useful entries #132

num_mismatch discards some useful entries #132

Comments

gregtatum commented Nov 15, 2023

XapaJIaMnu commented Nov 15, 2023

marco-c commented Nov 20, 2023

jelmervdl commented Nov 23, 2023

jelmervdl commented Nov 23, 2023

PinzhenChen commented Jan 17, 2024 • edited Loading

PinzhenChen commented Jan 17, 2024 •

edited

Loading