reorganize repo for thesis submission

JanProvaznik · May 9, 2024 · e9c3642 · e9c3642
1 parent f91fd8a
commit e9c3642
Show file tree

Hide file tree

Showing 29 changed files with 43 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,11 @@
 # Enigma Transformed 
 
 ## Abstract
-This project explores the possibility of using a pretrained language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by varying the test set and measuring accuracy.
-
+We explore the possibility of using a pre-trained Transformer language model to decrypt ciphers. The aim is also to discover what linguistic features of a text the model learns to use by measuring correlations of error rates.
 
+1. create evaluation dataset with linguistic properties
+2. train model on decipherment
+3. evaluate correlations and predictability from linguistic properties
 
 ## Docs
 ### How to run 
@@ -14,7 +16,7 @@ pip install -e .
 ```
 #### Slurm cluster
 - basic setting: `sbatch -p gpu -c1 --gpus=1 --mem=16G <bash_script_path>`
-- use `run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster
+- use `./run_notebook.sh <notebook_path>` to run a Jupyter notebook on a slurm cluster
 
 #### Colab
 - clone this repo and use the desired `.ipynb` files
@@ -23,17 +25,16 @@ pip install -e .
 !git clone https://github.com/JanProvaznik/enigma-transformed
 !pip install transformers[torch] Levenshtein py-enigma
 ```
-### Meta info
-- uses lowercased letters in all experiments
-- usually preserving spaces, punctuation in some
-- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
-- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
-- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
-- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers
 
 ### Source code
 #### reproducible/
-- for each experiment contians a notebook that can be used to reproduce it in a readable manner
+- scripts for fine-tuning ByT5 on ciphers
+- Used in thesis: 
+    - `21_vignere3_noisy_random_news_en.ipynb`, `22_vignere3_noisy_random_news_de.ipynb`, `23_vignere3_noisy_random_news_cs.ipynb` finetuning ByT5 to decrypt a random 3-letter key Vignere cipher on news sentences
+    - `24_const_noisy_enigma_news_cs.ipynb`, `25_const_noisy_enigma_news_de.ipynb`, `26_const_noisy_enigma_news_en.ipynb` finetuning ByT5 to decrypt a simplified Enigma cipher on news sentences
+
+
+- old experiments in `unused/` 
     - `01_copy_random_text.ipynb` - trains model to copy on random strings
     - `02_copy_news.ipynb` - trains model to copy on news sentences
     - `03_caesar_random_text.ipynb` - trains model to decrypt constant caesar cipher (only one setting) on random strings
@@ -47,10 +48,22 @@ pip install -e .
     - `10_vignere3_news` - trains model to decrypt constant 3 letter vignere cipher on news sentences
     - `11_vignere_long_news` - trains model to decrypt constant vignere cipher with key 'helloworld' on news sentences
     - `12_vignere_multiple_news` - trains model to decrypt 2 letter vignere cipher, with 3 settings on news sentences
+    - ... and more
+
+#### data/
+- `weird_classify.ipynb` and `lang_classify.ipynb` - filter out sentences
+- `measure_dataset(cs,de).ipynb` - annotate linguistic properties of a dataset
+- `evaluation_batchedgpuevaluate_other_models.ipynb` - inference decipherments by different model checkpoints
 
+#### analysis/
+- `loss_curves.ipynb` - visualize loss curves of training with error density at checkpoints
+- `corr_matrices.ipynb` - to create correlation matrices of error rates and linguistic properties
+- `evo_correlation.ipynb` - to graph of evolution of correlations of error rates and linguistic properties
+- `pred_shap.ipynb` - predict error rates with simple ML and analyze with shap
 
-#### run_notebook.sh
-- script for running a notebook on a slurm cluster 
+
+#### run_notebook.sh and run_notebook4gpu.sh
+- script for running training or inference notebooks on a slurm cluster with GPUs
 
 
 #### src/
@@ -77,8 +90,7 @@ pip install -e .
 ##### `lens_train.py`
 - script to replicate reproducible/03 with [TransformerLens](https://github.com/neelnanda-io/TransformerLens) library and minimal amount of resources (only 1 layer transformer)
 
-
-### Usual experiment pipeline
+### What happens when training
 0. get data from the internet or generate it
 1. filter the data for the given experiment (e.g. only sentences 100-200 characters long)
 2. preprocess the data: only a-z + spaces, trim/pad to desired length
@@ -87,19 +99,26 @@ pip install -e .
 5. train the model on the training pairs
 6. save the model
 7. evaluate the performance of the model (during training and after training)
-    - e.g. edit distances
+
+### Meta info
+- uses lowercased letters in all experiments
+- vigenere preserves spaces, enigma replaces them with X
+- using the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) to measure the error rate of the model on an evaluation dataset
+- using the [statmt newscrawl](https://statmt.org/) dataset to obtain real world text for training and evaluation
+- using the [Huggingface Transformers library](https://huggingface.co/transformers/) running on [PyTorch](https://pytorch.org/)
+- using pretrained [ByT5](https://arxiv.org/abs/2105.13626) character level models and fine-tuning them on ciphers
 
 ### Training hyperparameters:
 #### number of training examples
 - the more the better (if model sees all cipher configuations it won't have to generalize the cipher procedure, but only detect which configuation is used and apply it)
 
 #### trainable parameters in model
-- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time to use big batch sizes
+- the more the better, but we're limited by the GPU memory (and time), bigger models will use have harder time  using big batch sizes
 #### epochs
 - the more the better, but we're limited by the time we have
 
 #### batch size
-- if too low, model won't be able to learn any patterns
+- if too low, models won't be able to learn any patterns
 - generally the higher the better, but we're limited by the GPU memory 
     - trick: use gradient accumulation 
         - e.g. if we have batch size 16 and gradient accumulation 16 -> the effective batch size is 256

diff --git a/analysis/train_dynamics_analysis.ipynb → analysis/evo_correlation.ipynb b/analysis/train_dynamics_analysis.ipynb → analysis/evo_correlation.ipynb
diff --git a/reproducible/21_vignere3_noisy_random_news_en.ipynb b/reproducible/21_vignere3_noisy_random_news_en.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on EN news dataset"
    ]
   },
   {

diff --git a/reproducible/22_vignere3_noisy_random_news_de.ipynb b/reproducible/22_vignere3_noisy_random_news_de.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on DE news dataset"
    ]
   },
   {

diff --git a/reproducible/23_vignere3_noisy_random_news_cs.ipynb b/reproducible/23_vignere3_noisy_random_news_cs.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Vignere cipher (all possible settings, length 3) on CS news dataset"
    ]
   },
   {

diff --git a/reproducible/24_const_noisy_enigma_news_cs.ipynb b/reproducible/24_const_noisy_enigma_news_cs.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Enigma cipher on CS news dataset"
    ]
   },
   {

diff --git a/reproducible/25_const_noisy_enigma_news_de.ipynb b/reproducible/25_const_noisy_enigma_news_de.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Vignere cipher (all possible settings, length 3) on news dataset"
+    "# Enigma cipher on DE news dataset"
    ]
   },
   {

diff --git a/reproducible/26_const_noisy_enigma_news_en.ipynb b/reproducible/26_const_noisy_enigma_news_en.ipynb
@@ -6,7 +6,7 @@
    "metadata": {},
    "source": [
     "\n",
-    "# Const noisy enigma on english"
+    "# Const noisy enigma on english news dataset"
    ]
   },
   {

diff --git a/reproducible/01_copy_random_text.ipynb → ...oducible/unused/01_copy_random_text.ipynb b/reproducible/01_copy_random_text.ipynb → ...oducible/unused/01_copy_random_text.ipynb
diff --git a/reproducible/02_copy_news.ipynb → reproducible/unused/02_copy_news.ipynb b/reproducible/02_copy_news.ipynb → reproducible/unused/02_copy_news.ipynb
diff --git a/reproducible/03_caesar_random_text.ipynb → ...ucible/unused/03_caesar_random_text.ipynb b/reproducible/03_caesar_random_text.ipynb → ...ucible/unused/03_caesar_random_text.ipynb
diff --git a/reproducible/04_caesar_news.ipynb → reproducible/unused/04_caesar_news.ipynb b/reproducible/04_caesar_news.ipynb → reproducible/unused/04_caesar_news.ipynb
diff --git a/reproducible/05_triple_caesar_news.ipynb → ...ucible/unused/05_triple_caesar_news.ipynb b/reproducible/05_triple_caesar_news.ipynb → ...ucible/unused/05_triple_caesar_news.ipynb
diff --git a/...ible/06_all_caesar_hint_random_text.ipynb → ...used/06_all_caesar_hint_random_text.ipynb b/...ible/06_all_caesar_hint_random_text.ipynb → ...used/06_all_caesar_hint_random_text.ipynb
diff --git a/reproducible/07_all_caesar_hint_news.ipynb → ...ible/unused/07_all_caesar_hint_news.ipynb b/reproducible/07_all_caesar_hint_news.ipynb → ...ible/unused/07_all_caesar_hint_news.ipynb
diff --git a/reproducible/08_all_caesar_news.ipynb → reproducible/unused/08_all_caesar_news.ipynb b/reproducible/08_all_caesar_news.ipynb → reproducible/unused/08_all_caesar_news.ipynb
diff --git a/reproducible/09_vignere2_news.ipynb → reproducible/unused/09_vignere2_news.ipynb b/reproducible/09_vignere2_news.ipynb → reproducible/unused/09_vignere2_news.ipynb
diff --git a/reproducible/10_vignere3_news.ipynb → reproducible/unused/10_vignere3_news.ipynb b/reproducible/10_vignere3_news.ipynb → reproducible/unused/10_vignere3_news.ipynb
diff --git a/reproducible/11_vignere_long_news.ipynb → ...ducible/unused/11_vignere_long_news.ipynb b/reproducible/11_vignere_long_news.ipynb → ...ducible/unused/11_vignere_long_news.ipynb
diff --git a/reproducible/12_vignere_multiple_news.ipynb → ...ble/unused/12_vignere_multiple_news.ipynb b/reproducible/12_vignere_multiple_news.ipynb → ...ble/unused/12_vignere_multiple_news.ipynb
diff --git a/reproducible/13_vignere_random_news.ipynb → ...cible/unused/13_vignere_random_news.ipynb b/reproducible/13_vignere_random_news.ipynb → ...cible/unused/13_vignere_random_news.ipynb
diff --git a/reproducible/14_const_enigma_news.ipynb → ...ducible/unused/14_const_enigma_news.ipynb b/reproducible/14_const_enigma_news.ipynb → ...ducible/unused/14_const_enigma_news.ipynb
diff --git a/reproducible/15_rigged_caesar_news.ipynb → ...ucible/unused/15_rigged_caesar_news.ipynb b/reproducible/15_rigged_caesar_news.ipynb → ...ucible/unused/15_rigged_caesar_news.ipynb
diff --git a/reproducible/16_const_enigma_news_cs.ipynb → ...ible/unused/16_const_enigma_news_cs.ipynb b/reproducible/16_const_enigma_news_cs.ipynb → ...ible/unused/16_const_enigma_news_cs.ipynb
diff --git a/reproducible/17_const_enigma_news_de.ipynb → ...ible/unused/17_const_enigma_news_de.ipynb b/reproducible/17_const_enigma_news_de.ipynb → ...ible/unused/17_const_enigma_news_de.ipynb
diff --git a/...cible/18_const_noisy_enigma_news_de.ipynb → ...nused/18_const_noisy_enigma_news_de.ipynb b/...cible/18_const_noisy_enigma_news_de.ipynb → ...nused/18_const_noisy_enigma_news_de.ipynb
diff --git a/...cible/19_const_noisy_enigma_news_en.ipynb → ...nused/19_const_noisy_enigma_news_en.ipynb b/...cible/19_const_noisy_enigma_news_en.ipynb → ...nused/19_const_noisy_enigma_news_en.ipynb
diff --git a/...ble/20_vignere_noisy_random_news_en.ipynb → ...sed/20_vignere_noisy_random_news_en.ipynb b/...ble/20_vignere_noisy_random_news_en.ipynb → ...sed/20_vignere_noisy_random_news_en.ipynb
diff --git a/reproducible/load_and_explore_model.ipynb → ...cible/unused/load_and_explore_model.ipynb b/reproducible/load_and_explore_model.ipynb → ...cible/unused/load_and_explore_model.ipynb