Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to process text in Hmong yields 500 Internal Server Error from web-api #200

Open
joanise opened this issue Mar 15, 2023 · 3 comments

Comments

@joanise
Copy link
Member

joanise commented Mar 15, 2023

Context:
During our ICLDC workshop, a participant offered to help with Hmong g2p. Before contacting them, I wanted to see what und does with it, so I copy-pasted letters in the language from https://en.wikipedia.org/wiki/Hmong_language

Observations:
The Hmong letters are in the first higher plane in the unicode standard (Puachue Hmong starts at U+1E100, and Pahawh Hmong at U+16B00)
Refs: searching for Hmong in https://unicode.org/charts/ yields two charts:

Problem Inputs (in each case, put that input in the text box and click next step, result is from web-api):
yields 422 Could not find any words to align in the text.
𞄤𞄦𞄣‎𖬊𖬋 yields 500 Internal Server Error
𞄤𞄦𞄣‎𖬊𖬋 asdf yields a Possible Text Processing issue, with two given strings being mapped to empty output

There are multiple issues at play here:

  • The fonts we use don't support these characters
  • The first example that gave me 422, it looks like I just failed to cut and paste it correctly, my input was literally the diamond with a question mark. I guess it illustrates it's not that easy to grab these characters in the first place.
  • The U+1Exxx chars (second example) has only higher plane chars, and we obviously don't accept that as input. 500 is not good.
  • The third example gets slightly better results: the response is valid, except the U+1Exxx chars disappear and don't get mapped to any sounds.

Desired behaviour, each of which could be its own issue:

  • handle the font for this script, or maybe let the user specify an additional custom font
  • Fix the 500 (we don't want that on any input the user could type)
  • Add support for higher plane characters to our und mapping.
@joanise
Copy link
Member Author

joanise commented Mar 15, 2023

I just did a quick test with text_unidecode, and it only has values for plane 0 in Unicode:

$ python -c 'import text_unidecode as tu; print(len(tu._replaces));'
65535

So we would need a specific g2p for this language, and maybe we could submit an extension for text_unidecode but I'm really not sure we want to do that.

@roedoejet
Copy link
Collaborator

The font issue could be resolved with https://fonts.google.com/noto/specimen/Noto+Sans+Pahawh+Hmong but the problem is I'm not sure we want to start bundling all these fonts for every readalong - but being selective will require some more thinking

@joanise
Copy link
Member Author

joanise commented Mar 15, 2023

Yeah, I don't think we want to ship with all fonts all the time. Hmong would be a use case requiring a custom font for a given RA, I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants