Skip to content

oscar345/thesis-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

thesis-master

This is the repository for my master thesis. For my thesis a BERT model was trained on a dataset with limited data. The challange was to create a well-performing model the limited amount of data.

To do so, we used a variant of curriculum learning, where the model was trained on a simpler dataset first, and then moved on to the more complex dataset.

The datasets were simplified by reducing the amount of types in the dataset. We did this by representing infrequently used types by a more frequently used umbrella term.

Data

The data for training this model is available on the website of the BabyLM shared task. We used the data from last year's shared task. Here is the link to last year's shared task: BabyLM 2023

Files

The repository contains the following files:

  • main.py: The file with which you can preprocess the data, train a tokenizer and BERT model, and plot statistics about the data and results of the model.
  • requirements.txt: The file with all the required packages to run the code.
  • .tool-versions: The file with the versions of programs used in this project. We only used Python 3.11.0. If you have asdf installed, you can run asdf install to install the correct version of Python.

Usage

To run the code, you need to have Python 3.11.0 installed. When you have Python 3.11.0 installed, you can install the required packages by running the following command:

python -m venv .venv # Create a virtual environment
pip install -r requirements.txt # Install the required packages

After you have installed the required packages, you can run the code by running the following command:

python main.py --help

This will show which commands are available to run certain tasks. For example, to preprocess the data and create dataset 1, you can run the following command:

python main.py preprocess --input-dir data/babylm_data/babylm_10M --output-dir data/preprocessed/step-1 --most-common 1000 --clusters 500

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages