Add documentation on caching models. (#504)

* add docs on cacheing the models. * push changes
allenai · Jan 8, 2025 · c00cf71 · c00cf71
1 parent c0183be
commit c00cf71
Show file tree

Hide file tree

Showing 3 changed files with 33 additions and 3 deletions.
diff --git a/docs/ai2_internal.md b/docs/ai2_internal.md
@@ -1,3 +1,31 @@
+# Job submissions
+
+This document details some best practices when submitting jobs in our cluster.
+
+## Caching on Weka (Ai2-specific)
+
+Most of our cluster comes with shared file systems (e.g., [WEKA](https://beaker-docs.apps.allenai.org/)). To avoid downloading the same models hundreds and thousands of times, we should cache the models ande datasets in the shared file system. This can be done via
+
+```bash
+python mason.py \
+    --cluster ai2/jupiter-cirrascale-2 ai2/saturn-cirrascale ai2/neptune-cirrascale --image nathanl/open_instruct_auto --pure_docker_mode \
+    --workspace ai2/tulu-3-dev \
+    --priority normal \
+    --preemptible \
+    --budget ai2/allennlp \
+    --gpus 0 -- python scripts/cache_hf.py \
+    --model_name_or_path "allenai/Llama-3.1-Tulu-3-8B-DPO" \
+    --model_revision "1208_dpo_13b_tune8e-7__allenai_open_instruct_dev__8__1733807565" \
+    --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0
+```
+
+`mason.py` is our job submission script. It takes in the command after `--` and runs it in the specified clusters. During the job submission, it automatically tries to setup a shared Hugging Face cache with environment variables. For example, it sets
+* `HF_HOME=/weka/oe-adapt-default/allennlp/.cache/huggingface`. 
+* `HF_DATASETS_CACHE=/weka/oe-adapt-default/allennlp/.cache/huggingface`
+* `HF_HUB_CACHE=/weka/oe-adapt-default/allennlp/.cache/hub`
+
+As a result, the `allenai/Llama-3.1-Tulu-3-8B-DPO` and `allenai/RLVR-GSM-MATH-IF-Mixed-Constraints` will be cached in the shared file system.
+
 ### Ai2 Internal Evaluation
 
 We provide a script integrated with beaker for use internally at Ai2. For example, to run all the tulu 3 evals with easy uploading:

diff --git a/mason.py b/mason.py
@@ -48,6 +48,7 @@ def get_args():
         help="Beaker clusters on which the job could be run.",
         required=True,
     )
+    parser.add_argument("--max_retries", type=int, help="Number of retries", default=1)
     parser.add_argument("--budget", type=str, help="Budget to use.", required=True)
     parser.add_argument("--gpus", type=int, help="Number of gpus", default=0)
     parser.add_argument("--num_nodes", type=int, help="Number of nodes", default=1)
@@ -482,6 +483,7 @@ def main():
         description=args.description,
         tasks=[make_task_spec(args, command, i, beaker_secrets, whoami, args.resumable) for i, command in enumerate(commands)],
         budget=args.budget,
+        retry=beaker.RetrySpec(allowed_task_retries=args.max_retries)
     )
 
     exp = beaker_client.experiment.create(spec=experiment_spec)

diff --git a/scripts/cache_hf.py b/scripts/cache_hf.py
@@ -22,9 +22,9 @@
     --preemptible \
     --budget ai2/allennlp \
     --gpus 0 -- python scripts/cache_hf.py \
-    --model_name_or_path allenai/open_instruct_dev \
-    --model_revision olmo1124_13b_4k_finetune_epoch_2_7.5e-06__42__1732416565 \
-    --dataset_mixer_list ai2-adapt-dev/WildChat-prefs-280824_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_if_taxonomy_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_p0_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_p1_olmo2_7b 1.0 ai2-adapt-dev/ultrafeedback_cleaned_olmo2_7b 1.0 allenai/tulu-3-pref-personas-instruction-following 1.0
+    --model_name_or_path "allenai/Llama-3.1-Tulu-3-8B-DPO" \
+    --model_revision "1208_dpo_13b_tune8e-7__allenai_open_instruct_dev__8__1733807565" \
+    --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0
 """