Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev.casing #274

Merged
merged 6 commits into from
Nov 15, 2023
Merged

Dev.casing #274

merged 6 commits into from
Nov 15, 2023

Conversation

roedoejet
Copy link
Owner

Here's an attempt to address #270

I don't really understand the kwk mapping, so I just tried to address it in a general case. Curious to discuss with @joanise to see if it works sufficiently for what's needed for kwk.

@roedoejet roedoejet requested a review from joanise July 18, 2023 23:06
Copy link
Collaborator

@joanise joanise left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this handles all the hard cases correctly. Leaving a bunch of comments now, then I'll want to apply preserve case to kwk BOAS->Umista and test to see what happens with my unit test cases for that, and then I'll probably have more feedback.

At the very least this code needs quite a bit more unit testing to capture more tricky cases. See inline comments.

) -> TransductionGraph:
if case_equivalencies is None:
case_equivalencies = {}
reverse_case_equivalencies = {v: k for k, v in case_equivalencies.items()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's OK to assume this is a 1-1 mapping, as you do here.

Comment on lines 230 to 243
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144
self.assertEqual(transducer("TLaba").output_string, "\u2144aba")
self.assertEqual(transducer("tlaba").output_string, "λaba")
# I guess it's arguable what should happen here, but I'll just change case if any of the characters are differently cased
self.assertEqual(transducer("Tlaba").output_string, "\u2144aba")
Copy link
Collaborator

@joanise joanise Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in the opposite scenario? kwk has complexities that I'd like to know are addressed. E.g., I think this is what we want:

context based:

ᴇTO -> ETO
ᴇto -> eto

1 char -> 2 char context based:

ʟat -> tłat
ʟAT -> TŁAT

2 char -> 3 char context based:

EᴇT -> EʼA̱T
Eᴇt -> E'a̱t
eᴇt -> e'a̱t

and I'll want to test what preserve_case does in all these cases when we turn it on for kwk BOAS->Umista.

any_out_lower = any(x.islower() for x in out_sub)
# continue if character is un-caseable
if (
out_sub not in case_equivalencies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're assuming case_equivalencies only lists complete rule outputs, rather than characters? What about ʟ -> tλ which would rely on the standard "t":"T" but would have to declare "λ":"⅄"?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because we don't really have a notion of characters anywhere (re: the tokenization issue from before). here you would have to do {'tλ': 'T⅄'} I suppose

# just in case, append the out_sub
new_string += out_sub
tg.output_string = new_string
return tg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.

E.g., exercise these:

  • Given rule aa -> bb, Aaxyz should get mapped to Bbxyz
  • the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case
  • Given rule a -> bb AXYZ should get mapped to BBXYZ but Axyz to Bbxyz (although this is actually not fully reliable, e.g., in Spanish in some cases ll is capitalized as a unit, e.g. llama might capitalize to LLama though from my experience that's not a uniformly applied rule. And does kwk Umista consider as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.

E.g., exercise these:

  • Given rule aa -> bb, Aaxyz should get mapped to Bbxyz

Yes, this works.

  • the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case

I don't fundamentally really understand what is going on here, and I can't reproduce the examples you have above in the rules even without changing anything so we need to talk more about this.

  • Given rule a -> bb AXYZ should get mapped to BBXYZ but Axyz to Bbxyz (although this is actually not fully reliable, e.g., in Spanish in some cases ll is capitalized as a unit, e.g. llama might capitalize to LLama though from my experience that's not a uniformly applied rule. And does kwk Umista consider as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)

I disagree that for a rule a -> bb Axyz should get turned to Bbxyz. We will handle this as BBxyz, otherwise someone would need to add a case equivalency for bb and Bb, in which case it should work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might need to ask Daisy what she thinks is right for kwk. And I guess we should also test with kwk and see what comes out with your current implementation.

So for kwk my assumption has been that you can infer whether a small cap L or a small cap E should be interpreted as upper case by the case of the next letter. I suppose we should validate that assumption with Daisy before investing too much development effort making it work...

g2p/tests/test_transducer.py Show resolved Hide resolved
self.assertEqual(transducer("'A").output_string, "B")
self.assertEqual(transducer("E\u0301").output_string, "F")
self.assertEqual(transducer("E\u0301").output_string, "F")
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can help the reader of the code by inserting the rendered characters

Suggested change
# Test what happens in Heiltsuk. \u03BB should be capitalized as \u2144
# Test what happens in Heiltsuk. \u03BB (λ) should be capitalized as \u2144 (⅄)

@joanise
Copy link
Collaborator

joanise commented Jul 20, 2023

I'm playing with case_equivalencies now, to enhance the kwk mapping.
For example, for rule ᴇa,a̱ʼa, I'd like to declare the equivalency "a̱ʼa": "A̱ʼA", so that TᴇAtᴇa could get mapped to TA̱ʼAta̱ʼa rather than Ta̱ʼAta̱ʼa but I can't figure out how to make it happen.
A challenge, here, is that for this kwk mapping we'll have to declare a couple dozen equivalencies to make all the vowel pairs that include work correctly. Easier than duplicating all the rules, of course, but still has to all get listed if we keep the current solution in this PR.

So my questions is, what's the correct syntax in config.yaml to list a bunch of equivalencies?

@joanise
Copy link
Collaborator

joanise commented Nov 8, 2023

@roedoejet
Rebased and improved. While still nor perfect, I think it's time to merge this as good enough, assuming CI passes. It will help most of the time, and the errors in case preservation for kwk will just mean once in a while, a character that should have been case-preserved is not. I think it's better to accept this partial solution than do the perfect solution I was hoping for.

But boy, rebasing this over the pydantic update was not trivial! Please review my work carefully.

@joanise
Copy link
Collaborator

joanise commented Nov 8, 2023

Hum, I can't request a review from you @roedoejet because it was your PR in the first place. Review it anyway and comment here.

Copy link

codecov bot commented Nov 8, 2023

Codecov Report

Attention: 8 lines in your changes are missing coverage. Please review.

Comparison is base (f766a66) 92.78% compared to head (3811a9e) 92.44%.
Report is 1 commits behind head on main.

❗ Current head 3811a9e differs from pull request most recent head c31c66b. Consider uploading reports for the commit c31c66b to get more accurate results

Files Patch % Lines
g2p/transducer/__init__.py 78.94% 5 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #274      +/-   ##
==========================================
- Coverage   92.78%   92.44%   -0.34%     
==========================================
  Files          16       16              
  Lines        2231     2278      +47     
  Branches      498      514      +16     
==========================================
+ Hits         2070     2106      +36     
- Misses         91       98       +7     
- Partials       70       74       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Owner Author

@roedoejet roedoejet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - I can't approve it, since it's my own PR. My only remaining question is the one about interactions between case_sensitive and preserve_case

@@ -187,6 +187,10 @@ def test_case_sensitive(self):
self.assertEqual(transducer_case_sensitive("a").output_string, "a")
self.assertEqual(transducer("A").output_string, "b")

def test_case_equivalencies(self):
with self.assertRaises(exceptions.MalformedMapping):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good compromise

# Test what happens in Heiltsuk. \u03BB (λ) should be capitalized as \u2144 (⅄)
self.assertEqual(transducer("TLaba").output_string, "\u2144aba")
self.assertEqual(transducer("tlaba").output_string, "λaba")
# I guess it's arguable what should happen here, but I'll just change case if any of the characters are differently cased
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are pretty clear with what our implementation is

@@ -661,7 +661,13 @@ class _MappingModelDefinition(BaseModel):
"""Deprecated: Please use rule_ordering='as_written' """

case_sensitive: bool = True
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some tests about interactions between case_sensitive and preserve_case. Does case_sensitive=False and preserve_case=True work for example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, good question. It actually only makes sense to have case_sensitive=False when you set preserve_case=True. Having both be true should raise an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of preserve case is that case does affect the meaning of the characters (i.e., the mapping is case insensitive) but should be preserved because it has non-phonetic meaning (e.g., proper name, beginning of the sentence).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just added a model validator that raises when the two are set incompatibly.

@joanise joanise merged commit daf33b8 into main Nov 15, 2023
10 checks passed
@joanise joanise deleted the dev.casing branch November 15, 2023 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants