diff --git a/README.md b/README.md index 3e63e51dc..aa08b5b4c 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,17 @@

OLMo: Open Language Model

+

+ + GitHub License + + + GitHub release + +

+ +OLMo is a repository for training and using state-of-the-art open language models. +It is built by scientists, for scientists. ## Installation @@ -23,6 +34,16 @@ Otherwise you can install the model code by itself directly from PyPI with: pip install ai2-olmo ``` +## Models overview + +The core models in the OLMo family released so far are (all trained on the [Dolma dataset](https://huggingface.co/datasets/allenai/dolma)): +| Model | Training Tokens | Context Length | +|------|--------|---------| +| [OLMo 1B](https://huggingface.co/allenai/OLMo-1B) | 3 Trillion | 2048 | +| [OLMo 7B](https://huggingface.co/allenai/OLMo-7B) | 2.5 Trillion | 2048 | +| [OLMo 7B Twin 2T](https://huggingface.co/allenai/OLMo-7B-Twin-2T) | 2 Trillion | 2048 | + + ## Fine-tuning To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. @@ -46,3 +67,8 @@ torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ ``` Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. + + +## Evaluation + +Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo. \ No newline at end of file