Skip to content

Code and materials for our Interspeech 2025 paper: 'What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training'

License

Notifications You must be signed in to change notification settings

mdhk/SSL-NL-eval

Repository files navigation

SSL-NL-eval

Code and materials for our Interspeech 2025 paper: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training.

To use this repository:

  1. Download the SSL-NL dataset into SSL-NL/.
  2. Initialize the uv environment (first run sudo pip install uv to install the uv package manager if you haven't yet):
    • 2.1 Select python version & create environment
      uv python install 3.10.4
      uv python pin 3.10.4
      uv venv 'SSL-NL-env'
      
    • 2.2 Activate environment and install packages
      source SSL-NL-env/bin/activate
      uv pip install -r pyproject.toml
      
  3. Extract embeddings for each model and segment set (phone, word-clustering, word-rsa) by running the embedding extraction script. For example:
    python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="phone"
    python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-clustering"
    python extract_embeddings.py --model_id="amsterdamNLP/Wav2Vec2-NL" --segments="word-rsa"
    
    This will write the embeddings as pkl files into embeddings/.
  4. (Re)compute measures by running the analysis scripts for the extracted embeddings of each SSL-NL subset (MLS and IFADV). For example:
    python phone_probing.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python phone_ABX.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python phone_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python phone_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_phone_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python word_PCA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python word_LDA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-clustering_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    python word_RSA.py --embeddings_file="embeddings/amsterdamNLP_Wav2Vec2-NL_word-rsa_embs.pkl" --model_name="w2v2-nl" --subset="MLS"
    
    This will save the analysis results to results/.

Citation

The paper can be cited as follows:

de Heer Kloots, M., Mohebbi, H., Pouw, C., Shen, G., Zuidema, W., Bentum, M. (2025) What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training. Proc. Interspeech 2025, 256-260, doi: 10.21437/Interspeech.2025-1526

@inproceedings{deheerkloots25_interspeech,
  title     = {{What do self-supervised speech models know about Dutch?  Analyzing advantages of language-specific pre-training}},
  author    = {{Marianne {de Heer Kloots} and Hosein Mohebbi and Charlotte Pouw and Gaofei Shen and Willem Zuidema and Martijn Bentum}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{256--260}},
  doi       = {{10.21437/Interspeech.2025-1526}},
  issn      = {{2958-1796}},
}

About

Code and materials for our Interspeech 2025 paper: 'What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training'

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages