Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev.casing #274

Merged
merged 6 commits into from
Nov 15, 2023
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions g2p/mappings/langs/kwk/config-g2p.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ mappings:
out_lang: kwk-umista
rule_ordering: apply-longest-first
prevent_feeding: true
case_sensitive: false
preserve_case: true
authors:
- Fineen Davis
- Olivia Chen
Expand Down
Binary file modified g2p/mappings/langs/langs.json.gz
Binary file not shown.
20 changes: 19 additions & 1 deletion g2p/mappings/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -661,7 +661,13 @@ class _MappingModelDefinition(BaseModel):
"""Deprecated: Please use rule_ordering='as_written' """

case_sensitive: bool = True
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see some tests about interactions between case_sensitive and preserve_case. Does case_sensitive=False and preserve_case=True work for example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, good question. It actually only makes sense to have case_sensitive=False when you set preserve_case=True. Having both be true should raise an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of preserve case is that case does affect the meaning of the characters (i.e., the mapping is case insensitive) but should be preserved because it has non-phonetic meaning (e.g., proper name, beginning of the sentence).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just added a model validator that raises when the two are set incompatibly.

"""Lower all rules and conversion input"""
"""When false, lowercase all rules and conversion input"""

case_equivalencies: dict = {}
"""List of case equivalencies for preserve_case that are not already in the Unicode standard"""

preserve_case: bool = False
"""Preserve source case in output"""

escape_special: bool = False
"""Escape special characters in rules"""
Expand Down Expand Up @@ -755,6 +761,18 @@ def validate_norm_form(cls, v):
v = "none"
return v

@field_validator("case_equivalencies", mode="before")
@classmethod
def validate_case_equivalencies(cls, v):
if not v or v is None:
v = {}
for lower_case, upper_case in v.items():
if len(lower_case) != len(upper_case):
raise exceptions.MalformedMapping(
f"Sorry, the case equivalency between {lower_case} and {upper_case} is not valid because it is not the same length, please write rules such that any case equivalent is of equal length."
)
return v

# TODO[pydantic]: We couldn't refactor the `validator`, please replace it by `field_validator` manually.
# Check https://docs.pydantic.dev/dev-v2/migration/#changes-to-validators for more information.
@validator("rules_path", "abbreviations_path", "alignments_path", pre=True)
Expand Down
9 changes: 8 additions & 1 deletion g2p/tests/public/data/kwk.psv
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,17 @@ kwk-boas|kwk-umista|g·āyaxalisē|gayax̱alisi
kwk-boas|kwk-umista|x\u0323wēlaxᵋw\u1D07sdes|xwilax̱ʼwa̱sdis
kwk-boas|kwk-umista|ăwŭnagwīsē ʟ̣ēg̣adēs|a̱wunagwisi dłig̱adis
kwk-boas|kwk-umista|yîx ōmpas ōᵋmaxt!ālaʟēᵋyēxa|yix̱ umpas uʼmax̱t̓alatłiʼyix̱a
kwk-boas|kwk-umista|tsāg̣ᴇmas g·ōkwas Ts!ᴇxᵋēdē|tsag̱a̱mas gukwas Ts!a̱x̱ʼidi
kwk-boas|kwk-umista|tsāg̣ᴇmas g·ōkwas Ts!ᴇxᵋēdē|tsag̱a̱mas gukwas Tʼsa̱x̱ʼidi
kwk-boas|kwk-umista|lāx̣wa ᵋnāx̣wax|laxwa ʼnaxwax̱
kwk-boas|kwk-umista|g·ig̣ŭmaᵋyasa ᵋnᴇᵋmēmotasa|gig̱umaʼyasa ʼna̱ʼmimutasa
kwk-boas|kwk-umista|yîxs sēsᴇyūʟaēs|yix̱s sisa̱yutłaʼis
kwk-napa|kwk-ipa|gam̓ən|ɡaʔmən
kwk-napa|kwk-ipa|c̓ay̓ux̌ʷ|tʼsaʔyuχʷ
kwk-napa|kwk-ipa|wəq̓ʷɛʔs|wəqʼʷɛʔs

# Artificial data to test capitalization of kwk BOAS->Umista
kwk-boas|kwk-umista|TAtap!Aʟa|TAtap̓Atła
# A real word, capitalized
kwk-boas|kwk-umista|G·āyaxalisē|Gayax̱alisi
# This case not activated because it doesn't actually currently work
#kwk-boas|kwk-umista|TᴇAtᴇapʟ!Aʟa|TA̱ʼAta̱ʼaptʼłAtła
5 changes: 3 additions & 2 deletions g2p/tests/test_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ def test_convert(self):
out_lang,
word_to_convert,
tok_option,
reference_string,
)
error_count += 1

Expand All @@ -152,6 +153,7 @@ def test_convert(self):
out_lang,
word_to_convert,
tok_option,
reference_string,
) = first_failed_test
output_string = self.runner.invoke(
convert,
Expand All @@ -160,8 +162,7 @@ def test_convert(self):
self.assertEqual(
output_string,
reference_string.strip(),
f"{in_lang}->{out_lang} mapping error "
"for '{word_to_convert}'.\n"
f"{in_lang}->{out_lang} mapping error for '{word_to_convert}'.\n"
"Look for warnings in the log for any more mapping errors",
)

Expand Down
4 changes: 4 additions & 0 deletions g2p/tests/test_mappings.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,10 @@ def test_case_sensitive(self):
self.assertEqual(transducer_case_sensitive("a").output_string, "a")
self.assertEqual(transducer("A").output_string, "b")

def test_case_equivalencies(self):
with self.assertRaises(exceptions.MalformedMapping):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good compromise

Mapping(rules=[{"in": "a", "out": "b"}], case_equivalencies={"a": "AA"})

def test_escape_special(self):
mapping = Mapping(rules=[{"in": r"\d", "out": "digit"}])
mapping_escaped = Mapping(
Expand Down
37 changes: 37 additions & 0 deletions g2p/tests/test_transducer.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import os
from unittest import TestCase, main

from g2p.exceptions import MalformedMapping
from g2p.mappings import Mapping
from g2p.tests.public import PUBLIC_DIR
from g2p.transducer import CompositeTransducer, Transducer, normalize_edges
Expand Down Expand Up @@ -218,6 +219,42 @@ def test_deletion(self):
self.assertEqual(self.test_deletion_transducer_csv("a").output_string, "")
self.assertEqual(self.test_deletion_transducer_json("a").output_string, "")

def test_case_preservation(self):
mapping = Mapping(
rules=[
{"in": "'a", "out": "b"},
{"in": "e\u0301", "out": "f"},
{"in": "tl", "out": "λ"},
],
case_sensitive=False,
preserve_case=True,
norm_form="NFC",
case_equivalencies={"λ": "\u2144"},
)
transducer = Transducer(mapping)
self.assertEqual(transducer("'a").output_string, "b")
self.assertEqual(transducer("'A").output_string, "B")
self.assertEqual(transducer("E\u0301").output_string, "F")
roedoejet marked this conversation as resolved.
Show resolved Hide resolved
self.assertEqual(transducer("e\u0301").output_string, "f")
# Test what happens in Heiltsuk. \u03BB (λ) should be capitalized as \u2144 (⅄)
self.assertEqual(transducer("TLaba").output_string, "\u2144aba")
self.assertEqual(transducer("tlaba").output_string, "λaba")
# I guess it's arguable what should happen here, but I'll just change case if any of the characters are differently cased
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these tests are pretty clear with what our implementation is

self.assertEqual(transducer("Tlaba").output_string, "\u2144aba")
# case equivalencies that are not the same length cause indexing errors in the current implementation
with self.assertRaises(MalformedMapping):
Mapping(
rules=[
{"in": "'a", "out": "b"},
{"in": "e\u0301", "out": "f"},
{"in": "tl", "out": "λ"},
],
case_sensitive=False,
preserve_case=True,
norm_form="NFC",
case_equivalencies={"λ": "\u2144\u2144\u2144"},
)

def test_normalize_edges(self):
# Remove non-deletion edges with the same index as deletions
bad_edges = [
Expand Down
59 changes: 57 additions & 2 deletions g2p/transducer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,7 @@
def __init__(self, mapping: Mapping):
self.mapping = mapping
self.case_sensitive = mapping.case_sensitive
self.preserve_case = mapping.preserve_case
self.norm_form = mapping.norm_form
self.out_delimiter = mapping.out_delimiter
self._index_match_pattern = re.compile(r"(?<={)\d+(?=})")
Expand All @@ -428,7 +429,7 @@
def __repr__(self):
return f"{self.__class__} between {self.mapping.in_lang} and {self.mapping.out_lang}"

def __call__(self, to_convert: str, index: bool = False, debugger: bool = False):
def __call__(self, to_convert: str):
"""The basic method to transduce an input. A proxy for self.apply_rules.

Args:
Expand All @@ -439,7 +440,11 @@
and output characters and their corresponding edges representing the indices
of the transformation.
"""
return self.apply_rules(to_convert)
tg = self.apply_rules(to_convert)
if self.preserve_case:
return preserve_case(tg, self.mapping.case_equivalencies)
else:
return tg

@staticmethod
def _pua_to_index(string: str) -> int:
Expand Down Expand Up @@ -1257,3 +1262,53 @@
else:
return False
return result


def preserve_case(
tg: TransductionGraph, case_equivalencies: Dict[str, str] = None
) -> TransductionGraph:
if case_equivalencies is None:
case_equivalencies = {}

Check warning on line 1271 in g2p/transducer/__init__.py

View check run for this annotation

Codecov / codecov/patch

g2p/transducer/__init__.py#L1271

Added line #L1271 was not covered by tests
reverse_case_equivalencies = {v: k for k, v in case_equivalencies.items()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's OK to assume this is a 1-1 mapping, as you do here.

all_lower_case_equivalencies = case_equivalencies.keys()
all_upper_case_equivalencies = case_equivalencies.values()
new_string = ""
for item in tg.substring_alignments():
in_sub = item[0]
out_sub = item[1]
any_in_upper = any(x.isupper() for x in in_sub)
any_in_lower = any(x.islower() for x in in_sub)
any_out_upper = any(x.isupper() for x in out_sub)
any_out_lower = any(x.islower() for x in out_sub)
# continue if character is un-caseable
if (
out_sub not in case_equivalencies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're assuming case_equivalencies only lists complete rule outputs, rather than characters? What about ʟ -> tλ which would rely on the standard "t":"T" but would have to declare "λ":"⅄"?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because we don't really have a notion of characters anywhere (re: the tokenization issue from before). here you would have to do {'tλ': 'T⅄'} I suppose

and not any_out_upper
and not any_out_lower
):
new_string += out_sub
continue
# lower case using case equivalencies if they exist
if (
any_in_lower or in_sub in all_lower_case_equivalencies
) and out_sub in all_upper_case_equivalencies:
new_string += reverse_case_equivalencies[out_sub]
continue

Check warning on line 1296 in g2p/transducer/__init__.py

View check run for this annotation

Codecov / codecov/patch

g2p/transducer/__init__.py#L1295-L1296

Added lines #L1295 - L1296 were not covered by tests
# upper case using case equivalencies if they exist
elif (
any_in_upper or in_sub in all_upper_case_equivalencies
) and out_sub in all_lower_case_equivalencies:
new_string += case_equivalencies[out_sub]
continue
# change to upper if required
if any_in_upper and any_out_lower:
new_string += out_sub.upper()
continue
# change to lower if required
if any_in_lower and any_out_upper:
new_string += out_sub.lower()
continue

Check warning on line 1310 in g2p/transducer/__init__.py

View check run for this annotation

Codecov / codecov/patch

g2p/transducer/__init__.py#L1309-L1310

Added lines #L1309 - L1310 were not covered by tests
# just in case, append the out_sub
new_string += out_sub
tg.output_string = new_string
return tg
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.

E.g., exercise these:

  • Given rule aa -> bb, Aaxyz should get mapped to Bbxyz
  • the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case
  • Given rule a -> bb AXYZ should get mapped to BBXYZ but Axyz to Bbxyz (although this is actually not fully reliable, e.g., in Spanish in some cases ll is capitalized as a unit, e.g. llama might capitalize to LLama though from my experience that's not a uniformly applied rule. And does kwk Umista consider as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boy, it's hard to follow all this logic! We need more extensive unit testing just to make sure it works.

E.g., exercise these:

  • Given rule aa -> bb, Aaxyz should get mapped to Bbxyz

Yes, this works.

  • the hard stuff in kwk BOAS with those small caps being mapped to a context-dependent case

I don't fundamentally really understand what is going on here, and I can't reproduce the examples you have above in the rules even without changing anything so we need to talk more about this.

  • Given rule a -> bb AXYZ should get mapped to BBXYZ but Axyz to Bbxyz (although this is actually not fully reliable, e.g., in Spanish in some cases ll is capitalized as a unit, e.g. llama might capitalize to LLama though from my experience that's not a uniformly applied rule. And does kwk Umista consider as a single letter capitalized jointly, or as two letters with only the T being uppercased in capitalized words?)

I disagree that for a rule a -> bb Axyz should get turned to Bbxyz. We will handle this as BBxyz, otherwise someone would need to add a case equivalency for bb and Bb, in which case it should work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we might need to ask Daisy what she thinks is right for kwk. And I guess we should also test with kwk and see what comes out with your current implementation.

So for kwk my assumption has been that you can infer whether a small cap L or a small cap E should be interpreted as upper case by the case of the next letter. I suppose we should validate that assumption with Daisy before investing too much development effort making it work...