Code and materials for our Interspeech 2025 paper: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training.
- Download the SSL-NL dataset into
SSL-NL/
. - Initialize the uv environment (first run
sudo pip install uv
to install the uv package manager if you haven't yet):- 2.1 Select python version & create environment
uv python install 3.10.4 uv python pin 3.10.4 uv venv 'SSL-NL-env'
- 2.2 Activate environment and install packages
source SSL-NL-env/bin/activate uv pip install -r pyproject.toml
- 2.1 Select python version & create environment
- Extract embeddings for each model and segment set (phone, word-clustering, word-rsa) by running the embedding extraction script. For example:
This will write the embeddings as pkl files into
python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="phone" python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-clustering" python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-rsa"
embeddings/
. - (Re)compute measures by running the analysis scripts for the extracted embeddings of each SSL-NL subset (MLS and IFADV). For example:
This will save the analysis results to
python phone_probing.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python phone_ABX.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python phone_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python phone_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python word_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python word_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS" python word_RSA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-rsa_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
results/
.
The paper can be cited as follows:
de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526
@inproceedings{deheerkloots25_interspeech,
title = {{What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training}},
author = {{Marianne {de Heer Kloots} and Hosein Mohebbi and Charlotte Pouw and Gaofei Shen and Willem Zuidema and Martijn Bentum}},
year = {{2025}},
booktitle = {{Interspeech 2025}},
pages = {{256--260}},
doi = {{10.21437/Interspeech.2025-1526}},
issn = {{2958-1796}},
}