Skip to content

Commit

Permalink
Add documentation on caching models. (#504)
Browse files Browse the repository at this point in the history
* add docs on cacheing the models.

* push changes
  • Loading branch information
vwxyzjn authored Jan 8, 2025
1 parent c0183be commit c00cf71
Show file tree
Hide file tree
Showing 3 changed files with 33 additions and 3 deletions.
28 changes: 28 additions & 0 deletions docs/ai2_internal.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@
# Job submissions

This document details some best practices when submitting jobs in our cluster.

## Caching on Weka (Ai2-specific)

Most of our cluster comes with shared file systems (e.g., [WEKA](https://beaker-docs.apps.allenai.org/)). To avoid downloading the same models hundreds and thousands of times, we should cache the models ande datasets in the shared file system. This can be done via

```bash
python mason.py \
--cluster ai2/jupiter-cirrascale-2 ai2/saturn-cirrascale ai2/neptune-cirrascale --image nathanl/open_instruct_auto --pure_docker_mode \
--workspace ai2/tulu-3-dev \
--priority normal \
--preemptible \
--budget ai2/allennlp \
--gpus 0 -- python scripts/cache_hf.py \
--model_name_or_path "allenai/Llama-3.1-Tulu-3-8B-DPO" \
--model_revision "1208_dpo_13b_tune8e-7__allenai_open_instruct_dev__8__1733807565" \
--dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0
```

`mason.py` is our job submission script. It takes in the command after `--` and runs it in the specified clusters. During the job submission, it automatically tries to setup a shared Hugging Face cache with environment variables. For example, it sets
* `HF_HOME=/weka/oe-adapt-default/allennlp/.cache/huggingface`.
* `HF_DATASETS_CACHE=/weka/oe-adapt-default/allennlp/.cache/huggingface`
* `HF_HUB_CACHE=/weka/oe-adapt-default/allennlp/.cache/hub`

As a result, the `allenai/Llama-3.1-Tulu-3-8B-DPO` and `allenai/RLVR-GSM-MATH-IF-Mixed-Constraints` will be cached in the shared file system.

### Ai2 Internal Evaluation

We provide a script integrated with beaker for use internally at Ai2. For example, to run all the tulu 3 evals with easy uploading:
Expand Down
2 changes: 2 additions & 0 deletions mason.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ def get_args():
help="Beaker clusters on which the job could be run.",
required=True,
)
parser.add_argument("--max_retries", type=int, help="Number of retries", default=1)
parser.add_argument("--budget", type=str, help="Budget to use.", required=True)
parser.add_argument("--gpus", type=int, help="Number of gpus", default=0)
parser.add_argument("--num_nodes", type=int, help="Number of nodes", default=1)
Expand Down Expand Up @@ -482,6 +483,7 @@ def main():
description=args.description,
tasks=[make_task_spec(args, command, i, beaker_secrets, whoami, args.resumable) for i, command in enumerate(commands)],
budget=args.budget,
retry=beaker.RetrySpec(allowed_task_retries=args.max_retries)
)

exp = beaker_client.experiment.create(spec=experiment_spec)
Expand Down
6 changes: 3 additions & 3 deletions scripts/cache_hf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@
--preemptible \
--budget ai2/allennlp \
--gpus 0 -- python scripts/cache_hf.py \
--model_name_or_path allenai/open_instruct_dev \
--model_revision olmo1124_13b_4k_finetune_epoch_2_7.5e-06__42__1732416565 \
--dataset_mixer_list ai2-adapt-dev/WildChat-prefs-280824_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_if_taxonomy_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_p0_olmo2_7b 1.0 ai2-adapt-dev/sft_v3.9_p1_olmo2_7b 1.0 ai2-adapt-dev/ultrafeedback_cleaned_olmo2_7b 1.0 allenai/tulu-3-pref-personas-instruction-following 1.0
--model_name_or_path "allenai/Llama-3.1-Tulu-3-8B-DPO" \
--model_revision "1208_dpo_13b_tune8e-7__allenai_open_instruct_dev__8__1733807565" \
--dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0
"""


Expand Down

0 comments on commit c00cf71

Please sign in to comment.