Releases: allenai/OLMo-core
v1.8.0
What's new
Added 🎉
- Added support for tensor parallelism. See the
TransformerConfig
class for usage. - Added more downstream tasks from the model ladder.
- Added
io.copy_dir()
function. - Added new LR schedulers:
LinearWithWarmup
,InvSqrtWithWarmup
,ConstantWithWarmup
,SequentialScheduler
. - Added option to pre-download checkpoint files from remote storage before trying to load a checkpoint.
- Added a callback for sending Slack notifications.
- Makes the MPS device work on Apple Silicon
- Added
SkipStepAdamW
optimizer. - The trainer can load model-only checkpoints now.
- Added the option to throttle checkpoint uploads to one rank from each node at a time.
- Added support for logging rich Table objects as text in source mixture datasets.
- Added
unshard_strategy
parameter tounshard_checkpoint()
function inolmo_core.distributed.checkpoint
. - Added function
load_keys()
toolmo_core.distributed.checkpoint
.
Changed ⚠️
- Changed storage of shared shard state in sharded checkpoints from smallest shard to lowest rank (normally 0).
- Changed how the trainer handles loading a checkpoint when
load_path
is provided. Nowload_path
is only used if no checkpoint is found in thesave_folder
.
Fixed ✅
- Added missing
weights_only=False
argument to fix loading train checkpoints with newer versions of PyTorch. - Fixed bug where GCS upload does not retry on transient failures.
- Fixed bug where source mixture datasets were truncating source files instead of randomly sampling.
- Fixed bug in source mixture datsets where sampling from small npy files raised an mmap exception due to 0 instances in the sampled index.
Commits
7899e7c (chore) prepare for release v1.8.0
907b9c5 Send Slack notification on releases (#151)
1ef7851 fix get_mock_batch()
when training on MPS again
29a468d Fix mixture dataset class (#147)
98ccb67 remove ganymede cluster
205fe90 remove deleted cluster
7ec9114 always make mock batch on CPU
7122b1d save max steps to trainer state (#143)
9a78829 Log elapsed time per eval (#149)
075a36a Make training on the MPS device work (#131)
b4a195b Add more options to the unshard_checkpoint
function to help scale (#145)
16885ab fix merge list with prefix
7b755c9 minor logging improvement
212108f Add option to throttle checkpoint uploads to one rank from each node at a time (#142)
7633461 pull fixes from 32B branch (#139)
48abe8c checkpoint hot fix (#140)
0c096e2 Handle model-only checkpoints with the trainer
9818232 move release scripts to subfolder (#137)
05ab673 update cluster list (#136)
7ccf726 add pr comments on release
0ff19d7 update citation
7519e0a Change the way load_path
is handled (#132)
03a597a limit the number of exception lines posted to Slack
c634066 include link to Beaker job with Slack noties
3505660 Make context manager set original state correctly (#126)
9e0992b Add a callback for sending Slack notifications (#125)
6d60464 fix
ee27348 Sync eval changes in OLMo/ladder-1xC to here (#122)
0789479 Add option to pre-download checkpoint to load (#123)
1380f0e add copy_dir()
io function
5cc704f Add learning rate schedulers (#119)
de5be27 don't check for beaker-py upgrades
b0103f0 Fix loading train state for newer versions of torch
5de774f updates
8474ee8 update docker image tags
d3f6f01 Update PyTorch and other deps in Docker images, change naming scheme of images (#120)
10c4978 Publish Docker images to GHCR (#118)
d6981b3 Add support for tensor parallelism and add OLMo2-26B model config / train script (#117)
aa4d188 Update table formatting
v1.7.0
What's new
Added 🎉
- Added
key_mapping
argument toolmo_core.distributed.checkpoint.load_model_and_optim_state()
for loading checkpoints with different key names. - Added
load_key_mapping
field to the trainer, same idea as the newkey_mapping
argument above. - Added an implementation of nGPT called
NormalizedTransformer
. - Added an example showing how to convert a HuggingFace Llama 3.2 checkpoint into the right format for OLMo-core.
- Added an API for scaling RoPE embeddings.
- Added a
ModelLadder
API.
Changed ⚠️
- The
w_out
andnorm
top-level children of theTransformer
model are now wrapped together in anlm_head
module. Training scripts will have backwards compatibility with older checkpoints due to theload_key_mapping
explained above.
Fixed ✅
- (Optimization) Mark model input sizes as dynamic for
torch.compile()
to avoid recompile during evals or variable-sequence / batch size training. This doesn't seem to hurt throughput. - Made HTTPS and GCS IO functions more robust.
- Fixed a bug where we were always getting dolma2 tokenized validation data when generating config with DataMix.v3_small_ppl_validation.
Commits
62d2c9e (chore) prepare for release v1.7.0
cb77039 mark model ladder as a beta feature
08c8073 Adapt conversion script to work with OLMo2 models (#116)
8e716b5 Add model ladder building blocks (#114)
1647f78 Add some more tests for nGPT (#113)
37e0e88 improve docs
d68d47a Make nn configs more flexible (#112)
0bcc840 RoPE scaling, document how to convert HuggingFace checkpoints (#111)
7655a3b Add template variable to ppl validation file manifest (#110)
ca44cf4 Implement nGPT (#108)
c47df7c make IO functions more robust (#109)
4f2c8ef Update README.md
57b38ad Mark model input as dynamically sized (#105)
776e235 remove duplicate script
v1.6.3
What's new
Added 🎉
- Added
olmo_core.distributed.checkpoint.get_checkpoint_metadata()
function. - (BETA) Added flag to compile the optimizer step. So far only tested with AdamW. May not work with other optimizers.
Fixed ✅
- Old ephemeral checkpoints won't be removed until after the latest ephemeral checkpoint is saved successfully.
- Made GCS uploads more robust.
- Fixed single-node training on Google Augusta cluster.
numpy.random.dirichlet()
does not always sum to 1.0, so allow for a small tolerance in validating domain weights.
Commits
9c52bea (chore) prepare for release v1.6.3
ad5e9e5 Upgrade flash-attn to v2.7.0 (#104)
b9e9193 [beta] Enable compiling optimizer step (tested with AdamW) (#103)
fdbb76e Use allclose for comparing sum of small numbers (#102)
3284742 make GCS uploads more robust (#101)
63b3f43 Update isort requirement from <5.13,>=5.12 to >=5.12,<5.14 (#93)
dcbd988 update docs and theme version
6615ba9 Bump actions/download-artifact from 3 to 4 (#100)
2e2b35b Add function to get checkpoint metadata
c0e47cc clean up Dockerfile (#99)
6300bc7 replace printing table with logging table (#98)
e522886 Don't prematurely delete old ephemeral checkpoints (#97)
dea10fd Bump actions/upload-artifact from 3 to 4 (#90)
c2fe2db skip another test when creds missing
3ea9fa2 Bump softprops/action-gh-release from 1 to 2 (#87)
5a5c17f Bump actions/checkout from 3 to 4 (#91)
9c99b9c skip some tests when missing relevant credentials (#96)
53efa8c Bump actions/setup-python from 4 to 5 (#88)
d548d3b Bump actions/cache from 3 to 4 (#86)
ab80395 add depandabot config
v1.6.2
What's new
Added 🎉
- Added option to disable
GarbageCollectorCallback
, not that you'd want to do this usually, but I needed to run an experiment to show how important that callback is.
Fixed ✅
- Fixed a bug where some default callbacks could be added twice if given a different name by the user.
- Fixed a bug where some
Trainer
bookkeeping tasks may not complete before.fit()
returns.
Commits
2384472 (chore) prepare for release v1.6.2
f721fa1 Ensure all bookkeeping tasks complete (#85)
26a2c63 Some callback improvements (#84)
v1.6.1
What's new
Added 🎉
- Added
retries
field toBeakerLaunchConfig
. - Allow running on Augusta cluster with existing train scripts.
- Added
olmo_core.utils.logging_configured()
function to check if logging has been configured.
Fixed ✅
- Fixed a potential distributed deadlock bug when training without a separate CPU-only bookkeeping backend.
- Removed some unnecessary host-device syncs in
olmo_core.distributed.utils
. - Added
Trainer(Config).async_bookkeeping
field to toggle async bookkeeping.
Commits
cae88f5 (chore) prepare for release v1.6.1
83db5f7 Some fixes/improvements around synchronous bookkeeping operations (#83)
c435c94 increase timeout for CI checks
4a56200 update cluster list (#82)
e27ba74 Update throughput numbers, add logging_configured()
util function (#81)
bec0a3c Allow running on Augusta cluster (#80)
c7c3a5a Set env vars for Augusta cluster
b9351e2 Add retries
field to BeakerLaunchConfig
(#79)
v1.6.0
What's new
Added 🎉
- Added option to compile the trainer's loss function (
Trainer.compile_loss
). - Added
SourceMixtureDataset
for composing a training mixture based on ratios of source datasets. - Added
NumpyFSLDatasetMixture
for constructing aNumpyDatasetBase
from aSourceMixtureDataset
. Note this is only supported for FSL datasets. - Added tests for
SourceMixture*
andNumpyFSLDatasetMixture
. - Added
DownstreamEvaluatorCallbackConfig
class for running in-loop downstream eval via OLMo-in-loop-evals.
Changed ⚠️
- Moved some types into
olmo_core.data.types
to avoid some circular dependencies.
Fixed ✅
- Made GCS client more robust by automatically retrying timeout errors for most operations.
Commits
29e1276 (chore) prepare for release v1.6.0
da39e97 Add note about optional dependencies
81b1249 Missed _bust_index_cache in one spot (#78)
00d34f6 Add option to compile loss function, move logits FP32 casting into loss function (#77)
4928f82 Adds mixing loader for FSL datasets (#70)
ecb0686 Allow stopping the experiment on keyboard int
41400c4 Add Llama 8B config (#76)
282c120 Update Docker build (#75)
55d261e Make GCS client more robust (#74)
3fe59b6 Add a callback for downstream evals, update Docker builds (#73)
ecd523e include release chore commit in release notes
v1.5.0
What's new
Added 🎉
- Added Google Cloud support for
list_directory()
andclear_directory()
. - Added
CometCallback
for logging training runs to Comet.ml. - Added
DataMixBase
class, to allow extending to new data mix groups. - Added support for MoE-based models.
- Added method
DataLoaderBase.get_mock_batch()
. - Trainer now starts with a dry-run of a fake batch created by
DataLoaderBase.get_mock_batch()
. - Added
Callback.pre_backward()
,.pre_eval_batch()
, and.post_eval_batch()
methods. - Added
Trainer.model_forward()
,.get_losses()
, and.eval_batch()
methods. - Added a new
TransformerActivationCheckpointingMode
, "selected_ops" (requires torch 2.5 or newer).
Changed ⚠️
BeakerLaunchConfig.setup_steps
should now include steps to clone your repo (which it will by default). This change allows support for private repos.
Fixed ✅
prepare_cli_environment()
now callsadd_cached_path_clients()
.- Removed an unnecessary host-device sync.
Commits
984eb26 Update README.md
0f0d282 Update README.md
310866e Add FP8 numbers for the 13B
425f7db Add "selected_ops" transformer AC mode (#71)
d90292e Move transformer config components to its own submodule
4d3b231 Add support for MoE models with megablocks (#60)
6e32043 Add Google Cloud support for more io
functions (#69)
5af60ba Avoid an unnecessary host-device sync when created initial loss tensors (#68)
ad4c8bb Switch to comet callback in official train scripts
d90f5da Add comet API key to launch config
0c75ef6 Do a dry-run batch before starting training (#67)
71bc5c8 Add save_state_dict
function
9c25aed Update the Comet.ml callback (#66)
6e4ee4e Add BaseDataMix class (#65)
54a74c3 Add a Comet.ml trainer callback (#64)
9ba0e63 Update base image with newer torch and flash-attn versions (#63)
97172fc avoid omegaconf interpolation in setup steps
48892ee include clone commands in setup steps (#62)
v1.4.0
What's new
Changed ⚠️
- Updated default layer norm epsilon for OLMo models from
1e-5
to1e-6
to match latest model. - Renamed
FSLDataLoader
toNumpyFSLDataLoader
. - Renamed
VSLDataLoader
toNumpyVSLDataLoader
. - The trainer now takes a
data_loader: DataLoaderBase
instead of adataset: NumpyDatasetBase
.
Commits
55343dd fix loading training state dict
b921299 Allow unknown number of batches with data loaders
87f1e89 fix restarts for custom data loader
767c550 Add example of custom data loader
6237f7d Trainer now takes a data loader instead of a dataset (#59)
f6fc369 update default LN eps to match latest OLMo model (#58)
db522d1 allow loading via pickling
7d26589 make VSL curr config more flexible
v1.3.2
What's new
Added 🎉
- Added
Config.validate()
,Config.replace()
, andConfig.apply()
methods. - Trainer now records sequence length as a metric.
Fixed ✅
- Ensure additional cached-path clients are added in the process pool workers from some dataset preparation methods.
- Fixed
label_mask
tensor created byNumpyPaddedFSLDataset
. - Removed redundant warning messages about CUDA alloc retries.
- Fixed non-deterministic deadlock bug with async checkpointing.
Commits
a0a680d keep old beaker images instead of deleting them
25a71f4 Minor training improvements (#57)
f32d0bf Improve formatting of throughput table in readme (#56)
f7f3709 Add some more Config
methods
916ecf9 Fix label mask tensor created by NumpyPaddedFSLDataset
(#55)
b29838a Ensure additional scheme clients are added in worker procs