Test #8

roedoejet · 2019-09-19T19:52:37Z

Hey all,

So this is hopefully the last big exorcism of the g2p functionality from ReadAlong-Studio. The only thing left to go really is the LexiconG2P, and I'm sort of only half-certain I want to bring it over.

We've been documenting modules well with a standard that puts two shebangs at the top:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

The docstring is then wrapped with pound signs like:

#########################################################################
#
# end_to_end.py
#
# Takes an XML file (preferrably using TEI conventions) and
# makes:
#
# 1. An XML file with added IDs for elements (if the elements didn't
#    already have ID attributes)
# 2. An FSG file where the transitions are those IDs.
# 3. A dictionary file giving a mapping between IDs and approximate
#    pronunciations in ARPABET
#
#
# The XML file needs to have xml:lang attributes; if tokenization is
# not performed (indicated with <w> elements) it will be attempted
# automatically.  Alignment can be done at any level of analysis; if
# there are, for example, morpheme tags (<m>), you can make that be
# the level of analysis with the option --unit m
#
# TODO: Add numpy standard docstrings to functions
##########################################################################

I think this is a good standard to keep. All of our method and function docstrings were formatted slightly differently, so I thought we could pick a standard? I put a lot of docstrings in with the numpy docstring standard format, which I think is a nice balance of readability and features (we can autodoc in Sphinx with them). If anybody has strong feelings, we can change this, but just let me know! You can look at some of the commits here for examples or check it out here. There's a nice extension for vscode here, but you have to change the setting (Preferences > Settings > Extensions) to use numpy docstrings (the default is docBlockr)

By gutting the G2P, you'll notice all the lang files are gone. I'd like to try and get us to enter the lang files into G2P first. Languages get added as folders by ISO code here: https://github.com/roedoejet/g2p/tree/master/g2p/mappings/langs. Then, lookup tables can be added either as xlsx, csv, or json files. Configurations for the lookup tables are specific in yaml like so:

<<: &shared
  language_name: Atikamekw
mappings:
  - display_name: Atikamekw to IPA
    in_lang: atj
    out_lang: atj-ipa
    type: mapping
    authors:
      - David Huggines Daines
      - Patrick Littell
    mapping: atj_to_ipa.json
    <<: *shared
  - display_name: Atikamekw IPA to English IPA
    in_lang: atj-ipa
    out_lang: eng-ipa
    type: mapping
    mapping: atj_ipa_to_eng_ipa.json
    <<: *shared

I copied over all of the lang files from ReadAlong-Studio and put them in 'as-is', but you can continue to add more and the G2P module now also has a built-in IPA mapping method in its CLI.

So, let's say you just have a map from atj to atj-ipa, but no automatic mapping from atj-ipa to eng-ipa yet. After pip install g2p you can just type g2p generate-mapping atj --ipa and it will put the generated lookup table and its corresponding config file in g2p/mappings/langs/generated. Currently you have to then update the cached version of the mappings manually by typing g2p update, but this will likely be rolled into the g2p generate-mapping command.

Hopefully this is clear, let me know if any of you have any questions, and I'll add @joanise as a reviewer once @dhdaines adds him to the repo.

roedoejet · 2019-09-19T20:33:25Z

So I guess because I took some of the unittests with me, coverage decreased 4.7% and that's showing an unsuccessful check. I think this is fine, once we merge this, we can coordinate our unit testing and get our coverage up.

littell · 2019-09-20T16:31:40Z

The only thing left to go really is the LexiconG2P, and I'm sort of only half-certain I want to bring it over.

Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.)

Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.).

I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at.

Either way, we should keep a loose coupling between the alignment and g2p libraries so that people can integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions.

roedoejet · 2019-09-20T16:40:49Z

Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.)

I'm happy either way. I don't know any good, lightweight English g2p libraries off-hand. Any suggestions? We could also use some combination of the LexiconG2P and find/replace rules as a fallback for situations like Skyler. I'll take your lead on this!

Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.).

Yes, I think this was part of my hesitation around putting in the LexiconG2P.

I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at.

Agreed

Either way, we should keep a loose coupling between the alignment and g2p libraries so that people can integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions.

Also agreed. I think I should go back in and make this coupling a little looser for exactly that reason. Thanks!

dhdaines · 2019-09-20T16:46:25Z

Hmm well we could use the g2p in Flite, I suppose, though I think it implies downloading an entire en-us voice... Le ven. 20 sept. 2019 à 12:40, Aidan Pine <[email protected]> a écrit :

…

Maybe it's time to scrap LexiconG2P and integrate a better third-party English g2p. I've already run into a case where the English name "Skyler" didn't appear in the lexicon. Or we can just treat every lexical item in there as a find/replace rule, sort them by length like usual, and add a few letter-level rules to sweep up the rest. (E.g. so "SKY" is the longest thing caught there, and then rules for "L" and "ER" pick up the rest.) I'm happy either way. I don't know any good, lightweight English g2p libraries off-hand. Any suggestions? We could also use some combination of the LexiconG2P and find/replace rules as a fallback for situations like Skyler. I'll take your lead on this! Aside from this specific question, there's a bigger question of whether the g2p library should remain a single-paradigm library (only find/replace rules) or whether it could develop into a multi-paradigm library (FSTs, HMMs, NNs, etc.). Yes, I think this was part of my hesitation around putting in the LexiconG2P. I lean towards the former, I prefer the idea that any g2p mapping could be opened and manipulated in "G2P Studio", and that they all keep a straightforward execution paradigm so that people can understand what they're looking at. Agreed Either way, we should keep a loose coupling between the alignment and g2p libraries so that people *can* integrate existing g2p solutions without refactoring the alignment library, so that they can handle languages like English, Chinese, Japanese, etc. with more sophisticated solutions. Also agreed. I think I should go back in and make this coupling a little looser for exactly that reason. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/dhdaines/ReadAlong-Studio/pull/8?email_source=notifications&email_token=AAZLYUFGIBRXIYXXZ2WRTYLQKT4JDA5CNFSM4IYPMITKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7HIBEA#issuecomment-533627024>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZLYUGOR65PUQBVGSYB56LQKT4JDANCNFSM4IYPMITA> .

roedoejet · 2019-09-20T16:58:11Z

I just tried this ( pip install g2p_en) and it requires some kind-of-heavy other python packages like numpy and nltk (+ punkt), but maybe that's less heavy than the flite solution?

>>> g2p('Skyler')
['S', 'K', 'AY1', 'L', 'ER0']

littell · 2019-09-20T17:46:37Z

I lean towards anything we can pip install. And we already use numpy, at least, for the lang_id functionality. (I'm also planning on updating the "unidecode fallback" g2p to something more sophisticated, and I'll need numpy for that as well.)

Even so, the Flite install raises another good question, that eventually people will probably want to use G2P solutions that are more heavyweight. We should probably define a REST-y API so that one can serve compatible g2p systems inside containers, rather than end up with a monster package that installs everything, or a forked codebase. Just "If you want to bring your own G2P, spin up a web service that accepts the following requests."

(Even for this g2p library. I can imagine scenarios where someone doesn't want their language-specific rules inside the library, but is fine with other people accessing them as a web service.)

roedoejet · 2019-09-20T18:10:23Z

I lean towards anything we can pip install. And we already use numpy, at least, for the lang_id functionality. (I'm also planning on updating the "unidecode fallback" g2p to something more sophisticated, and I'll need numpy for that as well.)

Right, it's not in our requirements.txt, so this is a good reminder to add it.

Even so, the Flite install raises another good question, that eventually people will probably want to use G2P solutions that are more heavyweight. We should probably define a REST-y API so that one can serve compatible g2p systems inside containers, rather than end up with a monster package that installs everything, or a forked codebase. Just "If you want to bring your own G2P, spin up a web service that accepts the following requests."

This should be easy enough. We could expose a similar API through the G2P studio and document it with Swagger, then people could use the G2P swagger spec to bootstrap their API in whatever language makes sense for their project.

roedoejet · 2019-09-20T20:50:25Z

So, for action items:

Do we agree on adopting the numpy standard for docstrings? Here: https://github.com/dhdaines/ReadAlong-Studio/issues/16
Do we agree on using g2p_en for English G2P? Here: https://github.com/dhdaines/ReadAlong-Studio/issues/15
I will make an issue within the G2P library to write an API that other G2P systems can model from. Here: Expose g2p as RESTful API through g2p studio roedoejet/g2p#11
I will also make an issue here to allow G2P to be specified as a REST endpoint. Here: https://github.com/dhdaines/ReadAlong-Studio/issues/11
Write integration tests for the ReadAlong-Studio use of endpoints with the G2P library API
Make the readalongs g2p less coupled with the G2P library
Add numpy to requirements.txt Here: 764612a happy to change the version if anyone thinks it should be different
Add @joanise as a collaborator
Do we agree on merging this PR with master?

Make e2e test suite work with ap test file (coverage=41%)

roedoejet added 4 commits September 18, 2019 17:15

working docker file, gutted langs and moved to g2p

7dfd83d

renamed readalongs.g2p -> readalongs.text added lots of docstrings

9d964d5

more docstrings and comments

2c449ef

updated travis to use most recent version of g2p

69f4523

roedoejet requested review from littell and dhdaines September 19, 2019 19:53

added coverage configuration and reverted to python 3.6 compat

182988b

roedoejet mentioned this pull request Sep 20, 2019

g2p should be able to be outsourced to an api #11

Open

joanise and others added 5 commits September 25, 2019 17:37

Make e2e test suite work with ap test file (coverage=41%)

998712d

Update test_force_align.py

826cd34

Update test_force_align.py

aa44785

Merge pull request #13 from joanise/test

84964f4

Make e2e test suite work with ap test file (coverage=41%)

Merge branch 'master' into test

c9bcbb8

This was referenced Oct 31, 2019

Update the English g2p #15

Open

Refactor docstrings to Google standard format #16

Closed

roedoejet merged commit 92d51bb into master Oct 31, 2019

dhdaines mentioned this pull request Nov 7, 2019

Can't find language "eng" when converting roedoejet/g2p#21

Closed

joanise deleted the test branch May 27, 2021 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test #8

Test #8

roedoejet commented Sep 19, 2019

roedoejet commented Sep 19, 2019

littell commented Sep 20, 2019

roedoejet commented Sep 20, 2019

dhdaines commented Sep 20, 2019 via email

roedoejet commented Sep 20, 2019 •

edited

Loading

littell commented Sep 20, 2019

roedoejet commented Sep 20, 2019 •

edited

Loading

roedoejet commented Sep 20, 2019 •

edited

Loading

Test #8

Test #8

Conversation

roedoejet commented Sep 19, 2019

roedoejet commented Sep 19, 2019

littell commented Sep 20, 2019

roedoejet commented Sep 20, 2019

dhdaines commented Sep 20, 2019 via email

roedoejet commented Sep 20, 2019 • edited Loading

littell commented Sep 20, 2019

roedoejet commented Sep 20, 2019 • edited Loading

roedoejet commented Sep 20, 2019 • edited Loading

roedoejet commented Sep 20, 2019 •

edited

Loading

roedoejet commented Sep 20, 2019 •

edited

Loading

roedoejet commented Sep 20, 2019 •

edited

Loading