Applying information criteria in Skip-gram dimensionality selection

Implementation of Skip-gram Dimensionality Selection via information criteria (SNML, AIC, BIC).

Requirement

Please make sure your computer has installed these programs below:

python3.7
pip

Set up

Install dependencies

pip install -r requirements.txt

Google cloud storage (GCS) set up

We train models on multiple servers but save the result on a GCS bucket. Please create an env.ini file to store access to the bucket. The env.ini file should be place in the root directory. The file should include content as following:

[GCS]
sync = no
project_id = xxx
bucket = xxx
app_credential = xxx

Configs:

sync: set to no if you do not want to us GCS
project_id, bucket: information of the bucket
app_credential: path to json credential to access GCS

1. Preprocess data

1.1. Artificial data

Artificial data is generated using jupyter notebooks. Please refer to to notebooks below for the details of the data generation process.

Data for original Skip-Gram model:

notebooks/Generate context distributions.jpynb

Data for Skip-Gram Negative Sampling model:

notebooks/Generate context distributions - SGNS.jpynb

1.2. Text data

Run prepocess.py file to prepocess data. This file takes .txt file as input. Please remove special characters such as .,:? etc in the text file.
Parameters:

input: input text file path
output: output directory
batch_size: batch size of the process
window_size: the window size in Skip-Gram model Example:

python preprocess.py --input text8 --output data/text8 --batch_size 1000 --window_size 5

Others parameter for preprocessing such as subsampling threshold can be set in config.ini.

2. Train Skip-gram

Data after preprocess step can be use to train Skip-gram. Training commands are described as below:

2.1. Train original Skip-Gram model

Original Skip-Gram model should be trained using GPUs, we use tensorflow to train this model. Run tf_based/train.py to train this model.
Example:

python tf_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5

See config.ini and tf_based/train.py for more parameters settings.

2.2. Train Skip-Gram model with Negative Sampling

Skip-Gram Negative Sampling model is trained with numpy. Training process need context distribution to sample negative samples. Context distribution can be achieved by runing: utils/context_distribution_from_raw.py.
Run np_based/train.py to train this model.
Example:

python np_based/train.py --input_path data/text8/ --batch_size 10 --output_path output/text8/ --epochs 1 --n_embedding 5

See config.ini and np_based/train.py for more parameters settings.

3. Estimate AIC & BIC

Estimating AIC & BIC for original Skip-Gram and Skip-Gram Negative Sampling by following programs:

original Skip-Gram:

python tf_based/run_aic_bic.py

Skip-Gram Negative Sampling:

python np_based/run_aic_bic.py

See each python file for parameters setting.

4. Estimate SNML codelength

Estimating SNML for original Skip-Gram and Skip-Gram Negative Sampling by following programs:

original Skip-Gram:

python tf_based/snml/tf_based/train_snml.py

Skip-Gram Negative Sampling:

python np_based/train_snml.py

Parameters:

model: output directory of trained model
context_path: path to context distribution
snml_train_file: data file, which each element will be estimated codelength
scope: how many data elements will be estimated
epochs: the number of epochs in gradient descent while training data in SNML
n_context_sample: number of samples in importance sampling
learning_rate: learning rate in gradient descent while training data in SNML
continue_from: (do not set this in the first run) the number of element in the last run to continue
continue_scope: (do not set this in the first run) continue from previous run, this parameter state number of scope in the previous run

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applying information criteria in Skip-gram dimensionality selection

Requirement

Set up

Google cloud storage (GCS) set up

1. Preprocess data

1.1. Artificial data

1.2. Text data

2. Train Skip-gram

2.1. Train original Skip-Gram model

2.2. Train Skip-Gram model with Negative Sampling

3. Estimate AIC & BIC

4. Estimate SNML codelength

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
evaluation		evaluation
notebooks		notebooks
np_based		np_based
tf_based		tf_based
utils		utils
.gitignore		.gitignore
README.md		README.md
config.ini		config.ini
preprocess.py		preprocess.py
requirements.txt		requirements.txt

truythu169/snml-skip-gram

Folders and files

Latest commit

History

Repository files navigation

Applying information criteria in Skip-gram dimensionality selection

Requirement

Set up

Google cloud storage (GCS) set up

1. Preprocess data

1.1. Artificial data

1.2. Text data

2. Train Skip-gram

2.1. Train original Skip-Gram model

2.2. Train Skip-Gram model with Negative Sampling

3. Estimate AIC & BIC

4. Estimate SNML codelength

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages