Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test #8

Merged
merged 10 commits into from
Oct 31, 2019
Merged

Test #8

merged 10 commits into from
Oct 31, 2019

Conversation

roedoejet
Copy link
Collaborator

Hey all,

So this is hopefully the last big exorcism of the g2p functionality from ReadAlong-Studio. The only thing left to go really is the LexiconG2P, and I'm sort of only half-certain I want to bring it over.

We've been documenting modules well with a standard that puts two shebangs at the top:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

The docstring is then wrapped with pound signs like:

#########################################################################
#
# end_to_end.py
#
# Takes an XML file (preferrably using TEI conventions) and
# makes:
#
# 1. An XML file with added IDs for elements (if the elements didn't
#    already have ID attributes)
# 2. An FSG file where the transitions are those IDs.
# 3. A dictionary file giving a mapping between IDs and approximate
#    pronunciations in ARPABET
#
#
# The XML file needs to have xml:lang attributes; if tokenization is
# not performed (indicated with <w> elements) it will be attempted
# automatically.  Alignment can be done at any level of analysis; if
# there are, for example, morpheme tags (<m>), you can make that be
# the level of analysis with the option --unit m
#
# TODO: Add numpy standard docstrings to functions
##########################################################################

I think this is a good standard to keep. All of our method and function docstrings were formatted slightly differently, so I thought we could pick a standard? I put a lot of docstrings in with the numpy docstring standard format, which I think is a nice balance of readability and features (we can autodoc in Sphinx with them). If anybody has strong feelings, we can change this, but just let me know! You can look at some of the commits here for examples or check it out here. There's a nice extension for vscode here, but you have to change the setting (Preferences > Settings > Extensions) to use numpy docstrings (the default is docBlockr)

By gutting the G2P, you'll notice all the lang files are gone. I'd like to try and get us to enter the lang files into G2P first. Languages get added as folders by ISO code here: https://github.com/roedoejet/g2p/tree/master/g2p/mappings/langs. Then, lookup tables can be added either as xlsx, csv, or json files. Configurations for the lookup tables are specific in yaml like so:

<<: &shared
  language_name: Atikamekw
mappings:
  - display_name: Atikamekw to IPA
    in_lang: atj
    out_lang: atj-ipa
    type: mapping
    authors:
      - David Huggines Daines
      - Patrick Littell
    mapping: atj_to_ipa.json
    <<: *shared
  - display_name: Atikamekw IPA to English IPA
    in_lang: atj-ipa
    out_lang: eng-ipa
    type: mapping
    mapping: atj_ipa_to_eng_ipa.json
    <<: *shared

I copied over all of the lang files from ReadAlong-Studio and put them in 'as-is', but you can continue to add more and the G2P module now also has a built-in IPA mapping method in its CLI.

So, let's say you just have a map from atj to atj-ipa, but no automatic mapping from atj-ipa to eng-ipa yet. After pip install g2p you can just type g2p generate-mapping atj --ipa and it will put the generated lookup table and its corresponding config file in g2p/mappings/langs/generated. Currently you have to then update the cached version of the mappings manually by typing g2p update, but this will likely be rolled into the g2p generate-mapping command.

Hopefully this is clear, let me know if any of you have any questions, and I'll add @joanise as a reviewer once @dhdaines adds him to the repo.

@roedoejet
Copy link
Collaborator Author

So I guess because I took some of the unittests with me, coverage decreased 4.7% and that's showing an unsuccessful check. I think this is fine, once we merge this, we can coordinate our unit testing and get our coverage up.

@littell
Copy link
Collaborator

littell commented Sep 20, 2019

The only thing left to go really is the LexiconG2P, and I'm sort of only half-certain I want to bring it over.

Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.)

Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.).

I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at.

Either way, we should keep a loose coupling between the alignment and g2p libraries so that people can integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions.

@roedoejet
Copy link
Collaborator Author

Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.)

I'm happy either way. I don't know any good, lightweight English g2p libraries off-hand. Any suggestions? We could also use some combination of the LexiconG2P and find/replace rules as a fallback for situations like Skyler. I'll take your lead on this!

Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.).

Yes, I think this was part of my hesitation around putting in the LexiconG2P.

I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at.

Agreed

Either way, we should keep a loose coupling between the alignment and g2p libraries so that people can integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions.

Also agreed. I think I should go back in and make this coupling a little looser for exactly that reason. Thanks!

@dhdaines
Copy link
Collaborator

dhdaines commented Sep 20, 2019 via email

@roedoejet
Copy link
Collaborator Author

roedoejet commented Sep 20, 2019

I just tried this ( pip install g2p_en) and it requires some kind-of-heavy other python packages like numpy and nltk (+ punkt), but maybe that's less heavy than the flite solution?

>>> g2p('Skyler')
['S', 'K', 'AY1', 'L', 'ER0']

@littell
Copy link
Collaborator

littell commented Sep 20, 2019

I lean towards anything we can pip install. And we already use numpy, at least, for the lang_id functionality. (I'm also planning on updating the "unidecode fallback" g2p to something more sophisticated, and I'll need numpy for that as well.)

Even so, the Flite install raises another good question, that eventually people will probably want to use G2P solutions that are more heavyweight. We should probably define a REST-y API so that one can serve compatible g2p systems inside containers, rather than end up with a monster package that installs everything, or a forked codebase. Just "If you want to bring your own G2P, spin up a web service that accepts the following requests."

(Even for this g2p library. I can imagine scenarios where someone doesn't want their language-specific rules inside the library, but is fine with other people accessing them as a web service.)

@roedoejet
Copy link
Collaborator Author

roedoejet commented Sep 20, 2019

I lean towards anything we can pip install. And we already use numpy, at least, for the lang_id functionality. (I'm also planning on updating the "unidecode fallback" g2p to something more sophisticated, and I'll need numpy for that as well.)

Right, it's not in our requirements.txt, so this is a good reminder to add it.

Even so, the Flite install raises another good question, that eventually people will probably want to use G2P solutions that are more heavyweight. We should probably define a REST-y API so that one can serve compatible g2p systems inside containers, rather than end up with a monster package that installs everything, or a forked codebase. Just "If you want to bring your own G2P, spin up a web service that accepts the following requests."

This should be easy enough. We could expose a similar API through the G2P studio and document it with Swagger, then people could use the G2P swagger spec to bootstrap their API in whatever language makes sense for their project.

@roedoejet
Copy link
Collaborator Author

roedoejet commented Sep 20, 2019

So, for action items:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants