Skip to content

Commit

Permalink
reorganize repo for thesis submission
Browse files Browse the repository at this point in the history
  • Loading branch information
JanProvaznik committed May 9, 2024
1 parent f91fd8a commit e9c3642
Show file tree
Hide file tree
Showing 29 changed files with 43 additions and 24 deletions.
55 changes: 37 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Enigma Transformed

## Abstract
This project explores the possibility of using a pretrained language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by varying the test set and measuring accuracy.

We explore the possibility of using a pre-trained Transformer language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by measuring correlations of error rates.

1. create evaluation dataset with linguistic properties
2. train model on decipherment
3. evaluate correlations and predictability from linguistic properties

## Docs
### How to run
Expand All @@ -14,7 +16,7 @@ pip install -e .
```
#### Slurm cluster
- basic setting: `sbatch -p gpu -c1 --gpus=1 --mem=16G <bash_script_path>`
- use `run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster
- use `./run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster

#### Colab
- clone this repo and use the desired `.ipynb` files
Expand All @@ -23,17 +25,16 @@ pip install -e .
!git clone https://github.com/JanProvaznik/enigma-transformed
!pip install transformers[torch] Levenshtein py-enigma
```
### Meta info
- uses lowercased letters in all experiments
- usually preserving spaces, punctuation in some
- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers

### Source code
#### reproducible/
- for each experiment contians a notebook that can be used to reproduce it in a readable manner
- scripts for fine-tuning ByT5 on ciphers
- Used in thesis:
- `21_vignere3_noisy_random_news_en.ipynb`, `22_vignere3_noisy_random_news_de.ipynb`, `23_vignere3_noisy_random_news_cs.ipynb` finetuning ByT5 to decrypt a random 3-letter key Vignere cipher on news sentences
- `24_const_noisy_enigma_news_cs.ipynb`, `25_const_noisy_enigma_news_de.ipynb`, `26_const_noisy_enigma_news_en.ipynb` finetuning ByT5 to decrypt a simplified Enigma cipher on news sentences


- old experiments in `unused/`
- `01_copy_random_text.ipynb` - trains model to copy on random strings
- `02_copy_news.ipynb` - trains model to copy on news sentences
- `03_caesar_random_text.ipynb` - trains model to decrypt constant caesar cipher (only one setting) on random strings
Expand All @@ -47,10 +48,22 @@ pip install -e .
- `10_vignere3_news` - trains model to decrypt constant 3 letter vignere cipher on news sentences
- `11_vignere_long_news` - trains model to decrypt constant vignere cipher with key 'helloworld' on news sentences
- `12_vignere_multiple_news` - trains model to decrypt 2 letter vignere cipher, with 3 settings on news sentences
- ... and more

#### data/
- `weird_classify.ipynb` and `lang_classify.ipynb` - filter out sentences
- `measure_dataset(cs,de).ipynb` - annotate linguistic properties of a dataset
- `evaluation_batchedgpuevaluate_other_models.ipynb` - inference decipherments by different model checkpoints

#### analysis/
- `loss_curves.ipynb` - visualize loss curves of training with error density at checkpoints
- `corr_matrices.ipynb` - to create correlation matrices of error rates and linguistic properties
- `evo_correlation.ipynb` - to graph of evolution of correlations of error rates and linguistic properties
- `pred_shap.ipynb` - predict error rates with simple ML and analyze with shap

#### run_notebook.sh
- script for running a notebook on a slurm cluster

#### run_notebook.sh and run_notebook4gpu.sh
- script for running training or inference notebooks on a slurm cluster with GPUs


#### src/
Expand All @@ -77,8 +90,7 @@ pip install -e .
##### `lens_train.py`
- script to replicate reproducible/03 with [TransformerLens](https://github.com/neelnanda-io/TransformerLens) library and minimal amount of resources (only 1 layer transformer)


### Usual experiment pipeline
### What happens when training
0. get data from the internet or generate it
1. filter the data for the given experiment (e.g. only sentences 100-200 characters long)
2. preprocess the data: only a-z + spaces, trim/pad to desired length
Expand All @@ -87,19 +99,26 @@ pip install -e .
5. train the model on the training pairs
6. save the model
7. evaluate the performance of the model (during training and after training)
- e.g. edit distances

### Meta info
- uses lowercased letters in all experiments
- vigenere preserves spaces, enigma replaces them with X
- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers

### Training hyperparameters:
#### number of training examples
- the more the better (if model sees all cipher configuations it won't have to generalize the cipher procedure, but only detect which configuation is used and apply it)

#### trainable parameters in model
- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time to use big batch sizes
- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time using big batch sizes
#### epochs
- the more the better, but we're limited by the time we have

#### batch size
- if too low, model won't be able to learn any patterns
- if too low, models won't be able to learn any patterns
- generally the higher the better, but we're limited by the GPU memory
- trick: use gradient accumulation
- e.g. if we have batch size 16 and gradient accumulation 16 -> the effective batch size is 256
Expand Down
File renamed without changes.
2 changes: 1 addition & 1 deletion reproducible/21_vignere3_noisy_random_news_en.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Vignere cipher (all possible settings, length 3) on news dataset"
"# Vignere cipher (all possible settings, length 3) on EN news dataset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion reproducible/22_vignere3_noisy_random_news_de.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Vignere cipher (all possible settings, length 3) on news dataset"
"# Vignere cipher (all possible settings, length 3) on DE news dataset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion reproducible/23_vignere3_noisy_random_news_cs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Vignere cipher (all possible settings, length 3) on news dataset"
"# Vignere cipher (all possible settings, length 3) on CS news dataset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion reproducible/24_const_noisy_enigma_news_cs.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Vignere cipher (all possible settings, length 3) on news dataset"
"# Enigma cipher on CS news dataset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion reproducible/25_const_noisy_enigma_news_de.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Vignere cipher (all possible settings, length 3) on news dataset"
"# Enigma cipher on DE news dataset"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion reproducible/26_const_noisy_enigma_news_en.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"metadata": {},
"source": [
"\n",
"# Const noisy enigma on english"
"# Const noisy enigma on english news dataset"
]
},
{
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 comments on commit e9c3642

Please sign in to comment.