-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dev.casing #274
Dev.casing #274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this handles all the hard cases correctly. Leaving a bunch of comments now, then I'll want to apply preserve case to kwk BOAS->Umista and test to see what happens with my unit test cases for that, and then I'll probably have more feedback.
At the very least this code needs quite a bit more unit testing to capture more tricky cases. See inline comments.
) -> TransductionGraph: | ||
if case_equivalencies is None: | ||
case_equivalencies = {} | ||
reverse_case_equivalencies = {v: k for k, v in case_equivalencies.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's OK to assume this is a 1-1 mapping, as you do here.
g2p/tests/test_transducer.py
Outdated
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144 | ||
self.assertEqual(transducer("TLaba").output_string, "\u2144aba") | ||
self.assertEqual(transducer("tlaba").output_string, "λaba") | ||
# I guess it's arguable what should happen here, but I'll just change case if any of the characters are differently cased | ||
self.assertEqual(transducer("Tlaba").output_string, "\u2144aba") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we do in the opposite scenario? kwk has complexities that I'd like to know are addressed. E.g., I think this is what we want:
context based:
ᴇTO -> ETO
ᴇto -> eto
1 char -> 2 char context based:
ʟat -> tłat
ʟAT -> TŁAT
2 char -> 3 char context based:
EᴇT -> EʼA̱T
Eᴇt -> E'a̱t
eᴇt -> e'a̱t
and I'll want to test what preserve_case does in all these cases when we turn it on for kwk BOAS->Umista.
any_out_lower = any(x.islower() for x in out_sub) | ||
# continue if character is un-caseable | ||
if ( | ||
out_sub not in case_equivalencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're assuming case_equivalencies only lists complete rule outputs, rather than characters? What about ʟ -> tλ
which would rely on the standard "t":"T" but would have to declare "λ":"⅄"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because we don't really have a notion of characters anywhere (re: the tokenization issue from before). here you would have to do {'tλ': 'T⅄'}
I suppose
# just in case, append the out_sub | ||
new_string += out_sub | ||
tg.output_string = new_string | ||
return tg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.
E.g., exercise these:
- Given rule
aa -> bb
,Aaxyz
should get mapped toBbxyz
- the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case
- Given rule
a -> bb
AXYZ
should get mapped toBBXYZ
butAxyz
toBbxyz
(although this is actually not fully reliable, e.g., in Spanish in some casesll
is capitalized as a unit, e.g.llama
might capitalize toLLama
though from my experience that's not a uniformly applied rule. And does kwk Umista considertł
as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.
E.g., exercise these:
- Given rule
aa -> bb
,Aaxyz
should get mapped toBbxyz
Yes, this works.
- the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case
I don't fundamentally really understand what is going on here, and I can't reproduce the examples you have above in the rules even without changing anything so we need to talk more about this.
- Given rule
a -> bb
AXYZ
should get mapped toBBXYZ
butAxyz
toBbxyz
(although this is actually not fully reliable, e.g., in Spanish in some casesll
is capitalized as a unit, e.g.llama
might capitalize toLLama
though from my experience that's not a uniformly applied rule. And does kwk Umista considertł
as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)
I disagree that for a rule a -> bb Axyz should get turned to Bbxyz. We will handle this as BBxyz, otherwise someone would need to add a case equivalency for bb and Bb, in which case it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we might need to ask Daisy what she thinks is right for kwk. And I guess we should also test with kwk and see what comes out with your current implementation.
So for kwk my assumption has been that you can infer whether a small cap L or a small cap E should be interpreted as upper case by the case of the next letter. I suppose we should validate that assumption with Daisy before investing too much development effort making it work...
g2p/tests/test_transducer.py
Outdated
self.assertEqual(transducer("'A").output_string, "B") | ||
self.assertEqual(transducer("E\u0301").output_string, "F") | ||
self.assertEqual(transducer("E\u0301").output_string, "F") | ||
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can help the reader of the code by inserting the rendered characters
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144 | |
# Test what happens in Heiltsuk. \u03BB (λ) should be capitalized as \u2144 (⅄) |
I'm playing with case_equivalencies now, to enhance the kwk mapping. So my questions is, what's the correct syntax in config.yaml to list a bunch of equivalencies? |
@roedoejet But boy, rebasing this over the pydantic update was not trivial! Please review my work carefully. |
Hum, I can't request a review from you @roedoejet because it was your PR in the first place. Review it anyway and comment here. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #274 +/- ##
==========================================
- Coverage 92.78% 92.44% -0.34%
==========================================
Files 16 16
Lines 2231 2278 +47
Branches 498 514 +16
==========================================
+ Hits 2070 2106 +36
- Misses 91 98 +7
- Partials 70 74 +4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me - I can't approve it, since it's my own PR. My only remaining question is the one about interactions between case_sensitive
and preserve_case
@@ -187,6 +187,10 @@ def test_case_sensitive(self): | |||
self.assertEqual(transducer_case_sensitive("a").output_string, "a") | |||
self.assertEqual(transducer("A").output_string, "b") | |||
|
|||
def test_case_equivalencies(self): | |||
with self.assertRaises(exceptions.MalformedMapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good compromise
# Test what happens in Heiltsuk. \u03BB (λ) should be capitalized as \u2144 (⅄) | ||
self.assertEqual(transducer("TLaba").output_string, "\u2144aba") | ||
self.assertEqual(transducer("tlaba").output_string, "λaba") | ||
# I guess it's arguable what should happen here, but I'll just change case if any of the characters are differently cased |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests are pretty clear with what our implementation is
@@ -661,7 +661,13 @@ class _MappingModelDefinition(BaseModel): | |||
"""Deprecated: Please use rule_ordering='as_written' """ | |||
|
|||
case_sensitive: bool = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to see some tests about interactions between case_sensitive
and preserve_case
. Does case_sensitive=False
and preserve_case=True
work for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, good question. It actually only makes sense to have case_sensitive=False
when you set preserve_case=True
. Having both be true should raise an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics of preserve case is that case does affect the meaning of the characters (i.e., the mapping is case insensitive) but should be preserved because it has non-phonetic meaning (e.g., proper name, beginning of the sentence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just added a model validator that raises when the two are set incompatibly.
Here's an attempt to address #270
I don't really understand the
kwk
mapping, so I just tried to address it in a general case. Curious to discuss with @joanise to see if it works sufficiently for what's needed forkwk
.