-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32 b #121
Draft
dirkgr
wants to merge
124
commits into
main
Choose a base branch
from
32B
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
32 b #121
Changes from 1 commit
Commits
Show all changes
124 commits
Select commit
Hold shift + click to select a range
b94e702
Save more often
dirkgr 368abb8
Don't check for cancelation all the time
dirkgr c277d54
Make sure we use the same CE loss that we used for the 13B
dirkgr 7c74d8b
We're going to 5T!
dirkgr 53d61fe
We can live with a bigger eval batch size.
dirkgr 514abb8
Add MMLU downstream eval
dirkgr 011113e
Module isn't callable
dirkgr 2577397
Qwen-ish
dirkgr 93637a1
Make model bigger
dirkgr 784377d
It's now a 32B.
dirkgr eec7e10
6T tokens
dirkgr bd5edee
Official save folder
dirkgr f516f09
6.5T tokens
dirkgr 49264f5
Merge remote-tracking branch 'origin/main' into 32B
dirkgr 4bb5d5c
Merged
dirkgr 1ff1371
Change project name and location
dirkgr 4375612
Revert "Merged"
dirkgr 20b9b08
Revert "Module isn't callable"
dirkgr 7736198
Revert "Make sure we use the same CE loss that we used for the 13B"
dirkgr 8e0613f
We still want it fused!
dirkgr 5652953
One-in-two activation checkpointing
dirkgr 323c786
Merge remote-tracking branch 'origin/main' into 32B
dirkgr 4f676e2
Smaller microbatch
dirkgr d4e63fa
Wrap 3 in 4 blocks
dirkgr 7c22386
Don't compile the loss.
dirkgr f38bff4
Turn off broken eval
dirkgr 3bf2440
Go back to mbsz of 4
dirkgr ab5afcf
Set drop_last for DownstreamEvaluator to False
2015aroras 47f9545
Bring back Copa now that we have Shane's fix
dirkgr ee6aa90
Merge remote-tracking branch 'origin/32B' into 32B
dirkgr c656a41
Check if beaker loading issues are due to beaker changes by updating …
2015aroras 7852e1e
Try hsdp with 2 nodes per replica
2015aroras b19e76d
Revert "Try hsdp with 2 nodes per replica"
2015aroras a02dd95
Try activation checkpointing 3 in 4
2015aroras 6eaa5a3
Try activation checkpointing 3 in 4 + all feedforwards checkpointed
2015aroras b2a07de
Decrease microbatch size
2015aroras 9985d31
Try activation checkpointing on just feed forwards
2015aroras 4cc6a62
Fix name
dirkgr 1060499
Try to run with hybrid sharding.
dirkgr fb2a274
More batch
dirkgr 1073613
Revert "More batch"
dirkgr c553b98
There is something wrong with how the `common` object is set up.
dirkgr e49d4b7
We need a less sharded checkpoint and I guess this is the only way we…
dirkgr 9608482
Revert "We need a less sharded checkpoint and I guess this is the onl…
dirkgr 4804004
Async checkpointer may have problems with large checkpoints?
dirkgr fd4edb8
For loading checkpoints, it seems we need a longer timeout
dirkgr 1f79446
Revert "Async checkpointer may have problems with large checkpoints?"
dirkgr 072c616
Flight to safety
dirkgr 6ba3e23
Increase microbatch size up to 2 * 4096
2015aroras 07cc66c
Watching the 32B in a notebook
dirkgr 18e9a32
Merge branch '32B' of https://github.com/allenai/OLMo-core into 32B
dirkgr 2150b36
Merge branch 'main' into 32B
2015aroras c8cf403
Enable HSDP with pre-downloading
2015aroras d9cb6cf
Turn off hsdp
2015aroras 5f2cf19
Revert "Turn off hsdp"
2015aroras 19c8758
Add option to set thread_count
2015aroras 9a12202
Run formatter
2015aroras d5e6e2b
Limit thread count
2015aroras ea0acce
Decrease microbatch size
2015aroras d2a00a7
Increase microbatch size, increase activation checkpointing
2015aroras 016e426
Decrease microbatch size
2015aroras a28ca37
Decrease thread_count
2015aroras 1c33794
Thread count 1
2015aroras 484d01c
Back to FSDP
2015aroras 275364c
Back to HSDP, but with less replicas
2015aroras 54d5623
Merge branch 'main' into 32B
2015aroras 4644e6e
Microbatch size back to 1
2015aroras d7ed30e
Revert "Microbatch size back to 1"
2015aroras 0c47992
Back to FSDP
2015aroras 246eff6
Revert "Back to FSDP"
2015aroras b956e3f
Enable NCCL debug
2015aroras f877907
More debug info
2015aroras 58bef95
Merge branch 'main' into 32B
2015aroras c84708f
Disable pre_download, set higher thread count
2015aroras 56c4ab3
FSDP with AC of selected ops
2015aroras b5f3a86
Back to AC of just feedforward layers
2015aroras 3fbdeb0
Add new inloop evals
2015aroras b335cdf
Turn off NCCL debug
2015aroras 30f8f59
Merge branch 'main' into 32B
2015aroras e17e4b8
Make checkpoint writing respect thread count config
2015aroras ba49cc4
Add skip step optimizer changes
2015aroras 25ede33
Update 32B config with skip step adamw
2015aroras ac01e83
Try fix skip step optimizer
2015aroras ddd61ac
Try manual _std_mean impl
2015aroras 973a26c
Add skip step fixes
2015aroras baf5700
Have separate save and load thread counts
2015aroras b6762d8
Decrease threads used for saving
2015aroras d98f06d
Skipped steps and automatic spike analysis
dirkgr 4a68e9e
Use compile=True for optimizer
2015aroras d81cd12
Make gcs upload pass generation
2015aroras 0a04034
Update CHANGELOG
2015aroras 5acc7eb
Run formatter
2015aroras 213b03e
Make generation 0 when object does not exist
2015aroras b4994b0
Merge branch 'shanea/fix-upload-retries' into 32B
2015aroras 3b84351
Run formatting
2015aroras 178d9ad
Remove unneeded import
2015aroras 0b737aa
Add missing reload
2015aroras 3e6f9f1
Updated notebook
dirkgr 663d63a
Updated dashboard
dirkgr 496919b
Update the notebook
dirkgr a1854bd
Updated notebook
dirkgr f2de5f4
Retry on bad request
dirkgr 33c0f58
Add some more retries
dirkgr 86afc43
Updated the notebook
dirkgr 2e45a79
Update the dashboard
dirkgr e4e8fbb
Fix the way we use the step in the optimizer
dirkgr 146caaf
Dashboard update
dirkgr 393a462
Update dashboard
dirkgr d39c59d
New report
dirkgr 16983c4
Dashboard update
dirkgr 5e4d04f
No more ephemeral checkpoints
dirkgr eba0418
Don't eval so much
dirkgr 5605001
When you wait on someone, you bring them water.
dirkgr 7ce7efa
Updating the dashboard
dirkgr 05aa94f
Reorder ranks in GCP
dirkgr 9c86bf9
Rank 0 needs to remain rank 0
dirkgr e27b91d
Slightly less checkpointing
dirkgr 52b9b77
Revert "Slightly less checkpointing"
dirkgr f045eee
Turn off failure propagation to make slack notifier work better
2015aroras ddb3084
New dashboard
dirkgr 72e0ed1
Merge branch '32B' of https://github.com/allenai/OLMo-core into 32B
dirkgr d1d8dcb
hopefully make GCS client calls more robust
epwalsh a0700e8
Catch user exceptions as well as system exceptions when training fails
2015aroras 0595cf8
Revert "Catch user exceptions as well as system exceptions when train…
2015aroras File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This general approach sort of blows up our retry time from 10 mins to 30 mins. Sort of not a fan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But at least it looks like it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could always reduce the deadline/timeout