Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 b #121

Draft
wants to merge 124 commits into
base: main
Choose a base branch
from
Draft

32 b #121

Changes from 1 commit
Commits
Show all changes
124 commits
Select commit Hold shift + click to select a range
b94e702
Save more often
dirkgr Dec 8, 2024
368abb8
Don't check for cancelation all the time
dirkgr Dec 8, 2024
c277d54
Make sure we use the same CE loss that we used for the 13B
dirkgr Dec 8, 2024
7c74d8b
We're going to 5T!
dirkgr Dec 8, 2024
53d61fe
We can live with a bigger eval batch size.
dirkgr Dec 8, 2024
514abb8
Add MMLU downstream eval
dirkgr Dec 9, 2024
011113e
Module isn't callable
dirkgr Dec 9, 2024
2577397
Qwen-ish
dirkgr Dec 9, 2024
93637a1
Make model bigger
dirkgr Dec 9, 2024
784377d
It's now a 32B.
dirkgr Dec 10, 2024
eec7e10
6T tokens
dirkgr Dec 10, 2024
bd5edee
Official save folder
dirkgr Dec 10, 2024
f516f09
6.5T tokens
dirkgr Dec 10, 2024
49264f5
Merge remote-tracking branch 'origin/main' into 32B
dirkgr Dec 10, 2024
4bb5d5c
Merged
dirkgr Dec 10, 2024
1ff1371
Change project name and location
dirkgr Dec 10, 2024
4375612
Revert "Merged"
dirkgr Dec 10, 2024
20b9b08
Revert "Module isn't callable"
dirkgr Dec 10, 2024
7736198
Revert "Make sure we use the same CE loss that we used for the 13B"
dirkgr Dec 10, 2024
8e0613f
We still want it fused!
dirkgr Dec 10, 2024
5652953
One-in-two activation checkpointing
dirkgr Dec 10, 2024
323c786
Merge remote-tracking branch 'origin/main' into 32B
dirkgr Dec 10, 2024
4f676e2
Smaller microbatch
dirkgr Dec 10, 2024
d4e63fa
Wrap 3 in 4 blocks
dirkgr Dec 10, 2024
7c22386
Don't compile the loss.
dirkgr Dec 10, 2024
f38bff4
Turn off broken eval
dirkgr Dec 11, 2024
3bf2440
Go back to mbsz of 4
dirkgr Dec 11, 2024
ab5afcf
Set drop_last for DownstreamEvaluator to False
2015aroras Dec 11, 2024
47f9545
Bring back Copa now that we have Shane's fix
dirkgr Dec 11, 2024
ee6aa90
Merge remote-tracking branch 'origin/32B' into 32B
dirkgr Dec 11, 2024
c656a41
Check if beaker loading issues are due to beaker changes by updating …
2015aroras Dec 11, 2024
7852e1e
Try hsdp with 2 nodes per replica
2015aroras Dec 11, 2024
b19e76d
Revert "Try hsdp with 2 nodes per replica"
2015aroras Dec 11, 2024
a02dd95
Try activation checkpointing 3 in 4
2015aroras Dec 12, 2024
6eaa5a3
Try activation checkpointing 3 in 4 + all feedforwards checkpointed
2015aroras Dec 12, 2024
b2a07de
Decrease microbatch size
2015aroras Dec 13, 2024
9985d31
Try activation checkpointing on just feed forwards
2015aroras Dec 13, 2024
4cc6a62
Fix name
dirkgr Dec 16, 2024
1060499
Try to run with hybrid sharding.
dirkgr Dec 16, 2024
fb2a274
More batch
dirkgr Dec 16, 2024
1073613
Revert "More batch"
dirkgr Dec 16, 2024
c553b98
There is something wrong with how the `common` object is set up.
dirkgr Dec 16, 2024
e49d4b7
We need a less sharded checkpoint and I guess this is the only way we…
dirkgr Dec 16, 2024
9608482
Revert "We need a less sharded checkpoint and I guess this is the onl…
dirkgr Dec 16, 2024
4804004
Async checkpointer may have problems with large checkpoints?
dirkgr Dec 16, 2024
fd4edb8
For loading checkpoints, it seems we need a longer timeout
dirkgr Dec 16, 2024
1f79446
Revert "Async checkpointer may have problems with large checkpoints?"
dirkgr Dec 16, 2024
072c616
Flight to safety
dirkgr Dec 16, 2024
6ba3e23
Increase microbatch size up to 2 * 4096
2015aroras Dec 17, 2024
07cc66c
Watching the 32B in a notebook
dirkgr Dec 18, 2024
18e9a32
Merge branch '32B' of https://github.com/allenai/OLMo-core into 32B
dirkgr Dec 18, 2024
2150b36
Merge branch 'main' into 32B
2015aroras Dec 19, 2024
c8cf403
Enable HSDP with pre-downloading
2015aroras Dec 19, 2024
d9cb6cf
Turn off hsdp
2015aroras Dec 19, 2024
5f2cf19
Revert "Turn off hsdp"
2015aroras Dec 19, 2024
19c8758
Add option to set thread_count
2015aroras Dec 19, 2024
9a12202
Run formatter
2015aroras Dec 19, 2024
d5e6e2b
Limit thread count
2015aroras Dec 19, 2024
ea0acce
Decrease microbatch size
2015aroras Dec 19, 2024
d2a00a7
Increase microbatch size, increase activation checkpointing
2015aroras Dec 19, 2024
016e426
Decrease microbatch size
2015aroras Dec 20, 2024
a28ca37
Decrease thread_count
2015aroras Dec 20, 2024
1c33794
Thread count 1
2015aroras Dec 20, 2024
484d01c
Back to FSDP
2015aroras Dec 20, 2024
275364c
Back to HSDP, but with less replicas
2015aroras Dec 20, 2024
54d5623
Merge branch 'main' into 32B
2015aroras Dec 20, 2024
4644e6e
Microbatch size back to 1
2015aroras Dec 20, 2024
d7ed30e
Revert "Microbatch size back to 1"
2015aroras Dec 20, 2024
0c47992
Back to FSDP
2015aroras Dec 20, 2024
246eff6
Revert "Back to FSDP"
2015aroras Dec 20, 2024
b956e3f
Enable NCCL debug
2015aroras Dec 20, 2024
f877907
More debug info
2015aroras Dec 20, 2024
58bef95
Merge branch 'main' into 32B
2015aroras Dec 20, 2024
c84708f
Disable pre_download, set higher thread count
2015aroras Dec 20, 2024
56c4ab3
FSDP with AC of selected ops
2015aroras Dec 20, 2024
b5f3a86
Back to AC of just feedforward layers
2015aroras Dec 21, 2024
3fbdeb0
Add new inloop evals
2015aroras Dec 21, 2024
b335cdf
Turn off NCCL debug
2015aroras Dec 21, 2024
30f8f59
Merge branch 'main' into 32B
2015aroras Dec 21, 2024
e17e4b8
Make checkpoint writing respect thread count config
2015aroras Dec 22, 2024
ba49cc4
Add skip step optimizer changes
2015aroras Dec 22, 2024
25ede33
Update 32B config with skip step adamw
2015aroras Dec 22, 2024
ac01e83
Try fix skip step optimizer
2015aroras Dec 22, 2024
ddd61ac
Try manual _std_mean impl
2015aroras Dec 22, 2024
973a26c
Add skip step fixes
2015aroras Dec 22, 2024
baf5700
Have separate save and load thread counts
2015aroras Dec 22, 2024
b6762d8
Decrease threads used for saving
2015aroras Dec 22, 2024
d98f06d
Skipped steps and automatic spike analysis
dirkgr Dec 22, 2024
4a68e9e
Use compile=True for optimizer
2015aroras Dec 22, 2024
d81cd12
Make gcs upload pass generation
2015aroras Dec 23, 2024
0a04034
Update CHANGELOG
2015aroras Dec 23, 2024
5acc7eb
Run formatter
2015aroras Dec 23, 2024
213b03e
Make generation 0 when object does not exist
2015aroras Dec 23, 2024
b4994b0
Merge branch 'shanea/fix-upload-retries' into 32B
2015aroras Dec 23, 2024
3b84351
Run formatting
2015aroras Dec 23, 2024
178d9ad
Remove unneeded import
2015aroras Dec 23, 2024
0b737aa
Add missing reload
2015aroras Dec 23, 2024
3e6f9f1
Updated notebook
dirkgr Dec 23, 2024
663d63a
Updated dashboard
dirkgr Dec 24, 2024
496919b
Update the notebook
dirkgr Dec 24, 2024
a1854bd
Updated notebook
dirkgr Dec 27, 2024
f2de5f4
Retry on bad request
dirkgr Dec 28, 2024
33c0f58
Add some more retries
dirkgr Dec 28, 2024
86afc43
Updated the notebook
dirkgr Dec 29, 2024
2e45a79
Update the dashboard
dirkgr Dec 30, 2024
e4e8fbb
Fix the way we use the step in the optimizer
dirkgr Dec 31, 2024
146caaf
Dashboard update
dirkgr Dec 31, 2024
393a462
Update dashboard
dirkgr Jan 3, 2025
d39c59d
New report
dirkgr Jan 6, 2025
16983c4
Dashboard update
dirkgr Jan 7, 2025
5e4d04f
No more ephemeral checkpoints
dirkgr Jan 8, 2025
eba0418
Don't eval so much
dirkgr Jan 8, 2025
5605001
When you wait on someone, you bring them water.
dirkgr Jan 8, 2025
7ce7efa
Updating the dashboard
dirkgr Jan 8, 2025
05aa94f
Reorder ranks in GCP
dirkgr Jan 9, 2025
9c86bf9
Rank 0 needs to remain rank 0
dirkgr Jan 9, 2025
e27b91d
Slightly less checkpointing
dirkgr Jan 9, 2025
52b9b77
Revert "Slightly less checkpointing"
dirkgr Jan 9, 2025
f045eee
Turn off failure propagation to make slack notifier work better
2015aroras Jan 13, 2025
ddb3084
New dashboard
dirkgr Jan 14, 2025
72e0ed1
Merge branch '32B' of https://github.com/allenai/OLMo-core into 32B
dirkgr Jan 14, 2025
d1d8dcb
hopefully make GCS client calls more robust
epwalsh Jan 14, 2025
a0700e8
Catch user exceptions as well as system exceptions when training fails
2015aroras Jan 15, 2025
0595cf8
Revert "Catch user exceptions as well as system exceptions when train…
2015aroras Jan 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 13 additions & 9 deletions src/olmo_core/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -532,21 +532,25 @@ def _get_gcs_client():


def _gcs_is_retriable(exc: Exception) -> bool:
from google.api_core.retry import if_transient_error
from google.api_core.exceptions import BadRequest
from google.api_core.retry import if_transient_error

return (
if_transient_error(exc) or
isinstance(exc, requests.exceptions.Timeout) or
isinstance(exc, BadRequest) # Weird choice, but Google throws this transiently
if_transient_error(exc)
or isinstance(exc, requests.exceptions.Timeout)
or isinstance(exc, BadRequest) # Weird choice, but Google throws this transiently
)


def _get_gcs_retry():
from google.api_core.retry import Retry

return Retry(
predicate=_gcs_is_retriable, initial=1.0, maximum=10.0, multiplier=2.0, timeout=600.0
predicate=_gcs_is_retriable, # NOTE: it appears google might ignore this
initial=1.0,
maximum=10.0,
multiplier=2.0,
timeout=600.0,
)


Expand All @@ -559,7 +563,7 @@ def _get_gcs_conditional_retry():
return ConditionalRetryPolicy(_get_gcs_retry(), is_generation_specified, ["query_params"])


@retriable()
@retriable(retry_condition=_gcs_is_retriable)
def _gcs_file_size(bucket_name: str, key: str) -> int:
from google.api_core.exceptions import NotFound

Expand All @@ -574,7 +578,7 @@ def _gcs_file_size(bucket_name: str, key: str) -> int:
return blob.size


@retriable()
@retriable(retry_condition=_gcs_is_retriable)
def _gcs_get_bytes_range(bucket_name: str, key: str, bytes_start: int, num_bytes: int) -> bytes:
from google.api_core.exceptions import NotFound

Expand All @@ -590,7 +594,7 @@ def _gcs_get_bytes_range(bucket_name: str, key: str, bytes_start: int, num_bytes
)


@retriable()
@retriable(retry_condition=_gcs_is_retriable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This general approach sort of blows up our retry time from 10 mins to 30 mins. Sort of not a fan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But at least it looks like it works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could always reduce the deadline/timeout

def _gcs_upload(source: Path, bucket_name: str, key: str, save_overwrite: bool = False):
storage_client = _get_gcs_client()
bucket = storage_client.bucket(bucket_name)
Expand All @@ -612,7 +616,7 @@ def _gcs_upload(source: Path, bucket_name: str, key: str, save_overwrite: bool =
)


@retriable()
@retriable(retry_condition=_gcs_is_retriable)
def _gcs_clear_directory(bucket_name: str, prefix: str):
from google.api_core.exceptions import NotFound

Expand Down