Spoken Word2Vec

This repo contains python implementation of spoken word2vec models described in the following paper:

@inproceedings{spokenW2V,
  author={Mohammad Amaan Sayeed and Hanan Aldarmaki},
  title={{Spoken Word2Vec: Learning Skipgram Embeddings from Speech}},
  year=2024,
  booktitle={Proceedings of INTERSPEECH 2024}
}

These scripts are extensions of the character-based skipgram models available here.

Sample Data

We provide a subset of files from LibriSpeech dev-clean to illustrate the expected directory structure for the feature extraction scripts. To replicate the performance in the paper, you need to generate features for the whole LibriSpeech ASR Corpurs train-clean-100 set.

Alignment files are needed for identifying word boundaries

We also provide the full train-clean-100 set in text format, librispeech_100.txt and character-based word vectors char_embeddings.vec; these are are needed for evaluation.

Dependencies

The scripts run in python 3.x. You will need the following packages:

os, tqdm, argparse, pickle, pandas
torch, numpy, s3prl, sklearn, librosa
nltk, Levenshtein, gensim

We tested the code on the following versions: python 3.11.7, torch 1.13.1, scikit-learn 1.2.1

You will also need sufficient storage and system RAM for feature extraction. For example, if you extract HuBERT features, you will need at least 86GB of storage to run steps 1 and 2 for the train-clean-100 set, and more than 100GB of system RAM. In our experiments, we ran the code using one A100 GPU (40GB GPU RAM) and 230GB system RAM.

Steps

1. Feature Extraction

python step_1_extract_features.py --feature_type hubert
python step_2_process_feats.py --feature_type hubert

These steps process the input folder and generate the specified acoustic features from s3prl upstream. The supported features are: mfcc, hubert, and wav2vec2. Check the top of each script for additional details. The output consists of a list of utterances, each utterance is a list of words, and each word is a sequence of acoustic features.

2. KMeans Clustering

python step_3_create_clusters.py --feature_type hubert

This step trains KMeans clustering on 10% of the input vectors, then applies the clustering on all the vectors. The output consists of a list of utterances, each utterance is a list of words, and each word is a seuences of cluster ids.

You can also skip the previous steps and use the attached features features.zip

2. Train Skipgram Model

python step_4_sgns_C_clustered_hubert.py 4

This scripts trains the end-to-end skipgram with negative sampling (sgns) model using the discrete features generated in the previous step. The comamndline argument specifies the scale, s, which can be an integer from 1 to 4 (or more, but we only tested up to 4). This replicates the best performing model in the paper. The code may need about 2 days to train for 100 epochs. The learned embeddings are evaluated at the end of each epoch using correlations (see paper for more details).

3. Visualize Training Progress

If you save the output of the previous step to a file, you can use the following scripts to plot the correlations while training

python plot_corr.py train_s4.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spoken Word2Vec

Sample Data

Dependencies

Steps

1. Feature Extraction

2. KMeans Clustering

2. Train Skipgram Model

3. Visualize Training Progress

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LibriSpeech-Alignments		LibriSpeech-Alignments
LibriSpeech		LibriSpeech
two_stage		two_stage
README.md		README.md
char_embeddings.vec		char_embeddings.vec
correlation_plot.png		correlation_plot.png
features.zip		features.zip
librispeech_100.txt		librispeech_100.txt
model_e2e.png		model_e2e.png
plot_corr.py		plot_corr.py
step_1_extract_features.py		step_1_extract_features.py
step_2_process_feats.py		step_2_process_feats.py
step_3_create_clusters.py		step_3_create_clusters.py
step_4_sgns_C_clustered_hubert.py		step_4_sgns_C_clustered_hubert.py

djanibekov/SpokenWord2Vec

Folders and files

Latest commit

History

Repository files navigation

Spoken Word2Vec

Sample Data

Dependencies

Steps

1. Feature Extraction

2. KMeans Clustering

2. Train Skipgram Model

3. Visualize Training Progress

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages