SSL-NL-eval

Code and materials for our Interspeech 2025 paper: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training.

To use this repository:

Download the SSL-NL dataset into SSL-NL/.
Initialize the uv environment (first run sudo pip install uv to install the uv package manager if you haven't yet):
- 2.1 Select python version & create environment
```
uv python install 3.10.4
uv python pin 3.10.4
uv venv 'SSL-NL-env'
```
- 2.2 Activate environment and install packages
```
source SSL-NL-env/bin/activate
uv pip install -r pyproject.toml
```

Extract embeddings for each model and segment set (phone, word-clustering, word-rsa) by running the embedding extraction script. For example:

python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="phone"
python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-clustering"
python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-rsa"

This will write the embeddings as pkl files into embeddings/.

(Re)compute measures by running the analysis scripts for the extracted embeddings of each SSL-NL subset (MLS and IFADV). For example:

python phone_probing.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python phone_ABX.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python phone_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python phone_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python word_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python word_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

python word_RSA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-rsa_embs.pkl" --model_name="w2v2-nl" --subset="MLS"

This will save the analysis results to results/.

Citation

The paper can be cited as follows:

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526

@inproceedings{deheerkloots25_interspeech,
  title     = {{What do self-supervised speech models know about Dutch?  Analyzing advantages of language-specific pre-training}},
  author    = {{Marianne {de Heer Kloots} and Hosein Mohebbi and Charlotte Pouw and Gaofei Shen and Willem Zuidema and Martijn Bentum}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{256--260}},
  doi       = {{10.21437/Interspeech.2025-1526}},
  issn      = {{2958-1796}},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SSL-NL-eval

To use this repository:

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
embeddings		embeddings
results		results
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
extract_embeddings.py		extract_embeddings.py
get_fasttext_embs.py		get_fasttext_embs.py
model_utils.py		model_utils.py
phone_ABX.py		phone_ABX.py
phone_LDA.py		phone_LDA.py
phone_PCA.py		phone_PCA.py
phone_contrasts.csv		phone_contrasts.csv
phone_probing.py		phone_probing.py
pyproject.toml		pyproject.toml
word_LDA.py		word_LDA.py
word_PCA.py		word_PCA.py
word_RSA.py		word_RSA.py

License

mdhk/SSL-NL-eval

Folders and files

Latest commit

History

Repository files navigation

SSL-NL-eval

To use this repository:

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages