Skip to content

sil-ai/phone-it-in

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Phone-ing it in: Towards Flexible, Multi-Modal Language Model Training using Phonetic Representations of Data

Scripts and code used in the research paper Phone-ing it in: Towards Flexible Multi-Modal Language Model Training by Phonetic Representations of Data.

See also, our fork of MasakhaNER benchmark converted to phonetic representation at this github repo.

Currently working to clean/upload.

The Pipeline: steps from data to F1 score.

Overall the process was thus:

  • created various pretraining datasets using Bash, Epitran for text data, and Allosaurus for audiio data. With and without phonemization, with and without spaces, etc. Most such scripts can be found in src/data
  • pretrained SHIBA models using the datasets, enqueueing the pretraining runs using our ClearML server on a local compute cluster. These scripts can be found in src/models.
  • finetuned on our variations on MasakhaNER, mostly using Google Colab Pro notebooks. Copies of these notebooks can be found at src/models as well.

Data Preprocessing

Using our scripts, created a variety of training and test sets, taking various base sets and creating phonemized variations, with or without spaces, etc.

For a detailed listing including a number of processing details see this list

Scripts

Various scripts used for data processing

Processing audio data to phones.

  • convert_mp3_folder_to_wav_and_convert_bitrate.sh used to convert mp3 files to .wav with expected bitrate for allosaurus. We used this for Common Voice.
  • break_folder_into_subsets.sh used to break massive Common Voice dataset into a number of smaller folders each with a more manageable number of files in it.
  • phonemize_audio_data.sh, used to run phone recognition on ALFFA Swahili dataset.
  • run_allosaurus_on_common_voice.sh script for running allosaurus on common voice. Requires conversion to .wav first. Very similar to the one above, mostly just adapted to different folder structure and to convert one split at a time. Includes some notes on the process, including errors that resulted in only converting 205/258 subfolders. We had split the training set into 258 subfolders, each with a maximum of 2k files. setup a loop to go through the various "train" subfolders, e.g. train1, train2. Did 204/258 of them before the process failed, decided to move on due to time constraints to dev/test sets, which we converted without incident.
  • Phone inventories: used in the phone recognition process so that we can be sure that allosaurus/epitran output the same symbols.
  • remove_spaces_recursively_from_text_files_in_folder.sh allosaurus outputs a space between each phone. We remove these.

Text Data: Pre-cleaning ALFFA "gold" transcriptions.

  • clean_ALFFA_gold_transcriptions.sh ALFFA dataset audio files came paired with already-created transcriptions we call "gold" transcriptions. We had to do some editing/preprocessing to these.

Text Data: Grapheme to Phoneme

We converted several datasets of text/graphemes to phones. ALFFA gold transcriptions, Huggingface "Swahili language modeling" set, etch.

Convert MasakhaNER to character-based annotations.

Converting datasets to Shiba-compatible jsonlines format

Processed Dataset Samples:

Language Model Pretraining

MasakhaNER Fine-tuning

Experiments used Google Colab Pro+ for fine-tuning, as well as ClearML for tracking. Adapting these for use outside of that environment is left as an exercise to the reader.

Experimental Results

About

For the "phone-it-in" ACL 2022 paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages