Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features to match OpenLM #302

Merged
merged 43 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
c6b5ee9
cast logits to fp32 before output
epwalsh Sep 28, 2023
27e7c84
revert logit manual cast
epwalsh Sep 28, 2023
1850ebb
add option for no weight tying
epwalsh Sep 28, 2023
b852ec4
init ff_out weights
epwalsh Sep 28, 2023
a2f1517
fix init device for ff_out
epwalsh Sep 28, 2023
387f659
refactor how we cache buffers
epwalsh Sep 29, 2023
160d143
cache rope sin and cos
epwalsh Sep 29, 2023
7fc33c5
Refactor how RoPE is applied
epwalsh Sep 29, 2023
ee95fd3
remove unused import
epwalsh Sep 29, 2023
95c806c
Add back Olmo.device property
epwalsh Sep 29, 2023
299b5cc
Merge branch 'main' into petew/tweaks
epwalsh Sep 29, 2023
5a628a3
give cache a type, make it required in constructors
epwalsh Oct 2, 2023
ba80eba
Merge branch 'main' into petew/tweaks
epwalsh Oct 3, 2023
0e6dfcd
Merge branch 'main' into petew/tweaks
epwalsh Oct 6, 2023
b331f8b
add mitch config
epwalsh Oct 6, 2023
75c5813
add option to override hidden size
epwalsh Oct 6, 2023
b9805ff
MCLI configs
epwalsh Oct 6, 2023
36370d0
rename config option to `mlp_hidden_size`
epwalsh Oct 6, 2023
7e8b88f
don't use adaptive clipping
epwalsh Oct 6, 2023
a4577b6
clean up mcli config
epwalsh Oct 6, 2023
6b68368
Add option to skip pre-train ckpt (for debuggin)
epwalsh Oct 6, 2023
20c16da
No QK norm, no affines
epwalsh Oct 6, 2023
f51b04e
enable flash
epwalsh Oct 6, 2023
de4ba36
update configs
epwalsh Oct 6, 2023
20aca2a
apply rotary in FP32
epwalsh Oct 6, 2023
c47ab78
clean up
epwalsh Oct 6, 2023
e8be916
Add v1.5 mix mitch-ish
epwalsh Oct 6, 2023
d99da62
Add option to disable SSL with requests to S3
epwalsh Oct 6, 2023
119d4ec
schedule
epwalsh Oct 7, 2023
99e3729
Add LUMI config for mitch
epwalsh Oct 7, 2023
4829516
fix
epwalsh Oct 7, 2023
f2813ae
no save overwrite
epwalsh Oct 7, 2023
933c6ad
fix affine qnorm config
epwalsh Oct 8, 2023
d22ef89
Merge branch 'main' into petew/tweaks
epwalsh Oct 8, 2023
c110059
Don't download if we don't have to
epwalsh Oct 8, 2023
7906bd2
update configs
epwalsh Oct 8, 2023
40cddcd
remove duplicate field
epwalsh Oct 8, 2023
d97c172
clean up
epwalsh Oct 9, 2023
f42d0ac
remove urllib3 pin
epwalsh Oct 9, 2023
62fcb47
Ensure RoPE applied in full precision
epwalsh Oct 9, 2023
d9fb29c
make rope helpers instance methods
epwalsh Oct 9, 2023
580f2e9
Add note about attention mask
epwalsh Oct 10, 2023
657021c
clean up
epwalsh Oct 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions configs/mcli/v1-mix-medium-mitch-ish.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
run_name: v1-mix-medium-mitch-ish
image: mosaicml/pytorch:2.0.1_cu118-python3.10-ubuntu20.04
gpu_num: 216
cluster: r12z3
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: allenai/LLM
git_branch: main # make sure to update this!
pip_install: -e .
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=local_rank0_only
export OLMO_NO_SSL=1 # we get SSLErrors all the time on this cluster
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

cd LLM

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--nproc_per_node 8 \
scripts/train.py configs/v1-mix-medium-mitch-ish-mcli.yaml \
--run_name=v1-mix-mitch-ish \
--global_train_batch_size=2160
9 changes: 8 additions & 1 deletion configs/mcli/v1-mix-medium.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,19 @@ integrations:
- integration_type: git_repo
git_repo: allenai/LLM
git_branch: main # make sure to update this!
pip_install: -e .[all]
pip_install: -e .
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=local_rank0_only
export OLMO_NO_SSL=1 # we get SSLErrors all the time on this cluster
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

cd LLM

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
Expand Down
31 changes: 31 additions & 0 deletions configs/mcli/v1_5-mix-medium-mitch-ish.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
run_name: v1-5-mix-medium-mitch-ish # can't have "_" or "." here
image: mosaicml/pytorch:2.0.1_cu118-python3.10-ubuntu20.04
gpu_num: 216
cluster: r12z3
gpu_type: a100_40gb
integrations:
- integration_type: git_repo
git_repo: allenai/LLM
git_branch: main # make sure to update this!
pip_install: -e .
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=local_rank0_only
export OLMO_NO_SSL=1
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

cd LLM

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
--nnodes $NUM_NODES \
--node_rank $NODE_RANK \
--nproc_per_node 8 \
scripts/train.py configs/v1_5-mix-medium-mitch-ish-mcli.yaml \
--run_name=v1_5-mix-mitch-ish \
--global_train_batch_size=2160
9 changes: 8 additions & 1 deletion configs/mcli/v1_5-mix-medium.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,19 @@ integrations:
- integration_type: git_repo
git_repo: allenai/LLM
git_branch: main # make sure to update this!
pip_install: -e .[all]
pip_install: -e .
ssh_clone: true
command: |-
pip freeze
mkdir -p /root/.cache/torch/

export OMP_NUM_THREADS=8
export LOG_FILTER_TYPE=local_rank0_only
export OLMO_NO_SSL=1 # we get SSLErrors all the time on this cluster
#export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

cd LLM

torchrun \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
Expand Down
Loading
Loading