From f95766bd1f22f6fa14fc12f0cb1ee4cd29bb54ea Mon Sep 17 00:00:00 2001 From: epwalsh Date: Mon, 5 Feb 2024 14:57:21 -0800 Subject: [PATCH] Add instructions for reproducing runs --- README.md | 129 +++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 98 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index 290174117..d3e7be941 100644 --- a/README.md +++ b/README.md @@ -41,37 +41,13 @@ pip install ai2-olmo ## Models overview The core models in the OLMo family released so far are (all trained on the [Dolma dataset](https://huggingface.co/datasets/allenai/dolma)): -| Model | Training Tokens | Context Length | Training Config | W&B Logs | -|-------|-----------------|:--------------:|-----------------|----------| -| [OLMo 1B](https://huggingface.co/allenai/OLMo-1B) | 3 Trillion | 2048 | [configs/official/OLMo-1B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-1B.yaml) | | -| [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) | 2.5 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | [wandb.ai/ai2-llm/OLMo-7B](https://wandb.ai/ai2-llm/OLMo-7B/reports/OLMo-7B--Vmlldzo2NzQyMzk5) | -| [OLMo 7B Twin 2T](https://huggingface.co/allenai/OLMo-7B-Twin-2T) | 2 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | | - - -## Fine-tuning - -To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. - -Next, prepare your training config. There are many examples in the [`configs/`](./configs) directory that you can use as a starting point. The most important thing is to make sure the model parameters (the `model` field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line: - -- Update `load_path` to point to the checkpoint you want to start from. -- Set `reset_trainer_state` to `true`. -- Update `data.paths` to point to the `token_ids.npy` file you generated. -- Optionally update `data.label_mask_paths` to point to the `label_mask.npy` file you generated, unless you don't need special masking for the loss. -- Update `evaluators` to add/remove in-loop evaluations. - -Once you're satisfied with your training config, you can launch the training job via `torchrun`. For example: - -``` -torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ - --data.paths=[{path_to_data}/input_ids.npy] \ - --data.label_mask_paths=[{path_to_data}/label_mask.npy] \ - --load_path={path_to_checkpoint} \ - --reset_trainer_state -``` - -Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. +| Model | Training Tokens | Context Length | Training Config | W&B Logs | Data Order File(s) ☨ | +|-------|-----------------|:--------------:|-----------------|----------|--------------------| +| [OLMo 1B](https://huggingface.co/allenai/OLMo-1B) | 3 Trillion | 2048 | [configs/official/OLMo-1B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-1B.yaml) | | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy) | +| [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) | 2.5 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | [wandb.ai/ai2-llm/OLMo-7B](https://wandb.ai/ai2-llm/OLMo-7B/reports/OLMo-7B--Vmlldzo2NzQyMzk5) | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy), [Epoch 2](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wd2gxrza/train_data/global_indices.npy) | +| [OLMo 7B Twin 2T](https://huggingface.co/allenai/OLMo-7B-Twin-2T) | 2 Trillion | 2048 | [configs/official/OLMo-7B.yaml](https://github.com/allenai/OLMo/blob/main/configs/official/OLMo-7B.yaml) | | [Epoch 1](https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy) | +> ☨ *See [Inspecting training data](#inspecting-training-data) below for usage.* ## Inference @@ -99,7 +75,6 @@ olmo_pipe = pipeline("text-generation", model="allenai/OLMo-7B") print(olmo_pipe("Language modeling is")) ``` - ### Inference on finetuned checkpoints If you finetune the model using the code above, you can use the conversion script to convert a native OLMo checkpoint to a HuggingFace-compatible checkpoint @@ -116,6 +91,98 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-7B", torch_dtype=torch The quantized model is more sensitive to typing / cuda, so it is recommended to pass the inputs as inputs.input_ids.to('cuda') to avoid potential issues. +## Reproducibility + +### Training + +The configs used to train the official OLMo models are provided in the [`configs/official/`](https://github.com/allenai/OLMo/blob/main/configs/official) directory. + +Note that while the training and validation data is public and free to download, the paths to the data within those configs are pointed at a CloudFlare R2 bucket, which requires an API key for programmatic access. +So in order to use any of these configs to reproduce a training run you'll first have to download the corresponding data to a location of your choosing and then update the paths in the config accordingly. + +You can derive the public HTTP URL from an R2 URL by replacing `r2://olmo-data` with `https://olmo-data.org`. +For example, if the R2 data URL is: + +`r2://olmo-data/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy` + +then the corresponding public URL is: + +`https://olmo-data.org/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy` + +Once you've updated the data paths in the config you can launch a training run via `torchrun`. For example, to launch the 1B model training on a single 8x GPU node, you would run: + +```bash +torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml +``` + +You can use the same method to launch multi-node jobs as well. See [the documentation](https://pytorch.org/docs/stable/elastic/run.html) for `torchrun` to understand the additional arguments you'll need to configure the rendezvous backend / endpoint. + +### Inspecting training data + +You may be interesting in inspecting the exact tokens that composed a particular batch during the training of one of the OLMo models. +We provide tools to do this, but first you'll need to download the data as above (unless you have an R2 API key) and update the corresponding config accordingly. + +Then take note of the URL of the data order file you want, which can be found in the [Models Overview](#models-overview) table. For example, the data order file for the first epoch of the OLMo-7B model is [https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy). + +Once you have that you can use this snippet to inspect the data within a particular batch: + +```python +import numpy as np +from cached_path import cached_path + +from olmo.config import TrainConfig +from olmo.data import build_memmap_dataset + +# Update these paths to what you want: +data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy") +train_config_path = "configs/official/OLMo-7B.yaml" + + +cfg = TrainConfig.load(train_config_path) +dataset = build_memmap_dataset(cfg, cfg.data) +batch_size = cfg.global_train_batch_size +global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32) + + +def get_batch_instances(batch_idx: int) -> list[list[int]]: + batch_start = batch_idx * batch_size + batch_end = (batch_idx + 1) * batch_size + batch_indices = global_indices[batch_start:batch_end] + batch_instances = [] + for index in batch_indices: + token_ids = dataset[index]["input_ids"].tolist() + batch_instances.append(token_ids) + return batch_instances + + +# Get all 2048 x 2048 token IDs in the first batch. +get_batch_instances(0) +``` + + +## Fine-tuning + +To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. + +Next, prepare your training config. There are many examples in the [`configs/`](https://github.com/allenai/OLMo/blob/main/configs) directory that you can use as a starting point. The most important thing is to make sure the model parameters (the `model` field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line: + +- Update `load_path` to point to the checkpoint you want to start from. +- Set `reset_trainer_state` to `true`. +- Update `data.paths` to point to the `token_ids.npy` file you generated. +- Optionally update `data.label_mask_paths` to point to the `label_mask.npy` file you generated, unless you don't need special masking for the loss. +- Update `evaluators` to add/remove in-loop evaluations. + +Once you're satisfied with your training config, you can launch the training job via `torchrun`. For example: + +``` +torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ + --data.paths=[{path_to_data}/input_ids.npy] \ + --data.label_mask_paths=[{path_to_data}/label_mask.npy] \ + --load_path={path_to_checkpoint} \ + --reset_trainer_state +``` + +Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. ## Evaluation