Merge branch 'main' into olmo7-ablations

allenai · Mar 21, 2024 · 51303ea · 51303ea
2 parents 1fadaf8 + f8aef84
commit 51303ea
Show file tree

Hide file tree

Showing 75 changed files with 105,416 additions and 723 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## Unreleased
 
+### Added
+
+- Added support for Grouped Query Attention.
+- Added commonsense_qa and social_iqa downstream evaluation tasks
+
+### Changed
+
+- Rename `Olmo` to `OLMo` everywhere in the codebase
+- Disabled automatic garbage collection during training, instead we run manually at regular intervals to avoid ranks getting out-of-sync with their own gc.
+
+### Removed
+
+- Removed `AMDLayerNorm`, since the original layer norm bug has been fixed and we don't need this workaround anymore.
+- Removed `OLMoParallelBlock`.
+
+### Fixed
+
+- Don't log garbage on nodes that aren't rank 0
+- Don't crash in the HF code when we are referring to a tokenizer in a local file
+
+## [v0.2.5](https://github.com/allenai/OLMo/releases/tag/v0.2.5) - 2024-03-06
+
 ### Fixed
 
 - Fixed default value of `--tokenizer` argument to `scripts/prepare_tulu_data.py` to be an absolute path, not relative path, the script can be run from other directories.
@@ -15,14 +37,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added code to throw an error if `output_attentions` is set to `True` in forward call to `OLMoForCausalLM`. This functionality hasn't been implemented yet.
 - Correct scheme displayed in error messages that come from R2
 - Fixed running with multiple data loading workers in LUMI
+- Minor bug fix: uninitialized prompts variable
 
 ### Added
 - Added `output_hidden_states` argument and associated functionality to `OLMo` and `OLMoForCausalLM` to return model intermediate hidden states.
 - Ability to read from R2 like we read from S3
 - Added MMLU downstream evaluation tasks, with prompt variations.
 - Added support for PyTorch v2.2.
 - Added ability to show logs from all ranks
+- Added option for QKV clipping.
+- Added basic_arithmetic downstream evaluation task
+
+### Changed
 
+- Changed legacy checkpoint unsharding to use processes and shared memory instead of threads
 
 
 ## [v0.2.4](https://github.com/allenai/OLMo/releases/tag/v0.2.4) - 2024-02-02

diff --git a/Makefile b/Makefile
@@ -29,7 +29,7 @@ base-image :
 	docker build -f docker/Dockerfile.base -t $(IMAGE_NAME_BASE)-base .
 
 .PHONY : gantry-image
-gantry-image : base-image
+gantry-image :
 	docker build -f docker/Dockerfile.gantry -t $(IMAGE_NAME_BASE)-gantry .
 	beaker image create $(IMAGE_NAME_BASE)-gantry --name $(IMAGE_NAME_BASE)-gantry-tmp --workspace $(BEAKER_WORKSPACE)
 	beaker image delete $(GANTRY_IMAGE) || true

diff --git a/README.md b/README.md
@@ -38,7 +38,9 @@ Otherwise you can install the model code by itself directly from PyPI with:
 pip install ai2-olmo
 ```
 
-## Models overview
+## Models
+
+### Overview
 
 The core models in the OLMo family released so far are (all trained on the [Dolma dataset](https://huggingface.co/datasets/allenai/dolma)): 
 | Model | Training Tokens | Context Length | Training Config | W&B Logs | Data Order File(s) ☨ |
@@ -49,6 +51,13 @@ The core models in the OLMo family released so far are (all trained on the [Dolm
 
 > ☨ *See [Inspecting training data](#inspecting-training-data) below for usage.*
 
+### Checkpoints
+
+URLs to checkpoints at intermediate steps of the models' trainings can be found in the csv files under [`checkpoints/official/`](https://github.com/allenai/OLMo/blob/main/checkpoints/official). These 'directory' URLs cannot currently be directly accessed, but files within the directory are publicly accessible. These URLs can also be provided to the training script to resume training from the checkpoint (see [Training](#training)). Each checkpoint directory consists of:
+
+- `config.yaml`: the config at that training step.
+- `model.pt`, `optim.pt`, `train.pt`: model, optimizer and training state at that training step.
+
 ## Inference
 
 You can utilize our Hugging Face integration to run inference on the olmo checkpoints:
@@ -117,6 +126,13 @@ torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml
 
 You can use the same method to launch multi-node jobs as well. See [the documentation](https://pytorch.org/docs/stable/elastic/run.html) for `torchrun` to understand the additional arguments you'll need to configure the rendezvous backend / endpoint.
 
+To resume training from a checkpoint, you can pass its path (local or URL)
+to `scripts/train.py` with the `--load_path` arguments. For example, to resume training from step 1000 of the OLMo 1B run:
+
+```bash
+torchrun --nproc_per_node=8 scripts/train.py configs/official/OLMo-1B.yaml --load_path https://olmo-checkpoints.org/ai2-llm/olmo-small/w1r5xfzt/step1000-unsharded
+```
+
 ### Inspecting training data
 
 You may be interesting in inspecting the exact tokens that composed a particular batch during the training of one of the OLMo models.