Changes for 65B and 70B runs #414

dirkgr · 2024-01-24T01:54:33Z

No description provided.

…tchish65-2

Muennighoff

The GQA looks good to me

Muennighoff · 2024-02-07T07:39:52Z

olmo/config.py

@@ -243,6 +242,14 @@ class ModelConfig(BaseConfig):
    The number of self-attention heads.
    """

+    n_kv_heads: Optional[int] = None
+    """
+    The number of heads to use for keys and values.


Suggested change

The number of heads to use for keys and values.

The number of heads to use for keys and values. Defaults to `n_heads`.

Done, I just can't click "commit" here for some reason.

olmo/config.py

epwalsh · 2024-02-07T17:10:12Z

olmo/config.py

+            if hasattr(new_config, "optimizer"):
+                new_config.optimizer = OptimizerConfig.update_legacy_settings(new_config.optimizer)


I don't think we need this here

I am learning that update_leacy_settings doesn't work anyways with settings you specify on the command line.

epwalsh

Here's an alternative approach that doesn't involve implementing update_legacy_settings().

epwalsh · 2024-02-07T17:11:52Z

olmo/config.py

@@ -309,8 +317,7 @@ class ModelConfig(BaseConfig):

    multi_query_attention: bool = False


Make this Optional[bool], defaulting to None.

epwalsh · 2024-02-07T17:13:11Z

olmo/config.py

+        if self.n_kv_heads is None:
+            self.n_kv_heads = self.n_heads


Then here we could do this:

if self.multi_query_attention: self.n_kv_heads = 1 elif self.n_kv_heads is None: self.n_kv_heads = self.n_heads

epwalsh · 2024-02-07T17:13:22Z

olmo/config.py

+            self.n_kv_heads = self.n_heads
+
+    @classmethod
+    def update_legacy_settings(cls, config: D) -> D:


Then this won't be needed.

…tchish65-2

…o mitchish65-2-gqa

Updates to 70B config and checkpointing

GQA into Mitchich65

dirkgr

I think there is some leftover debug commenting? Other than that, looks good.

dirkgr · 2024-03-19T00:25:25Z

olmo/checkpoint.py

            # Save metadata.
            self._save_metadata(checkpoint_dir, upload_to=upload_to)

+            # Save config.
+            self._save_config(checkpoint_dir, upload_to=upload_to)


Maybe add a comment explaining why this is now last?

dirkgr · 2024-03-19T00:26:11Z

olmo/model.py

@@ -245,7 +260,7 @@ def __init__(self, config: ModelConfig, cache: BufferCache):
        self.config = config
        self.__cache = cache
        # Warm up cache.
-        self.get_rotary_embedding(config.max_sequence_length, _non_meta_init_device(config))
+        #  self.get_rotary_embedding(config.max_sequence_length, _non_meta_init_device(config))


We don't need this anymore?

scripts/beaker/mitchish70.sh

dirkgr · 2024-03-19T00:30:02Z

scripts/beaker/mitchish70.sh

+SEED=3423
+INIT=fan_in
+RUN_NAME="fan-in-init-${SEED}"
+ARGS="--run_name=${RUN_NAME} --data.seed=6198 --seed=${SEED} --model.init_fn=${INIT} --model.init_std=0.006 --model.init_cutoff_factor=3 --device_train_microbatch_size=4 --model.flash_attention=true --fused_loss=true --evaluators=[] --stop_at=500 --wandb.group=mitchish70-ablate-init --save_interval_ephemeral=100"


Looks like this isn't the final config anyways.

Cleaned up in 92d2a08.

- Update cluster - Up-sample wikipedia

dirkgr added 15 commits January 7, 2024 23:31

Stops training after a specified number of steps

e7a73ce

Get ready to run 16h at a time

ead441e

Switch to cosine LR

9152ea5

Merge remote-tracking branch 'origin/main' into mitchish65-2

9b2de63

Batch size warmup config, doubling once

f860ad1

128 nodes

672e4b4

More ephemeral checkpoints

41cfd01

Run roughly 1000 steps per slurm job

ea81867

Merge branch 'main' into mitchish65-2

1690d8b

Eval every 1000 steps regardless of saving

98e93e4

Merge branch 'mitchish65-2' of https://github.com/allenai/LLM into mi…

0348696

…tchish65-2

Go back to 48h jobs

f1bab7a

Implements grouped query attention

c5c5e3a

70B config using GQA

09beb0c

Removes OLMo Parallel Block

d0500ab

Muennighoff approved these changes Feb 7, 2024

View reviewed changes

epwalsh reviewed Feb 7, 2024

View reviewed changes

dirkgr added 12 commits February 7, 2024 18:11

Switch to flash attention proper.

1acac5c

LUMI Dockerfile that builds Flash Attention

9c6a144

Adds another mitchish65 script that lets us specify a random seed

53f9ba6

Merge branch 'mitchish65-2' of https://github.com/allenai/LLM into mi…

d4e5ef7

…tchish65-2

Merge branch 'mitchish65-2' into mitchish65-2-gqa

d4b14f9

Makes the config actually work

bc41306

Run script

fbc8cba

Remove obsolete docstring

95dd681

Merge branch 'mitchish65-2-gqa' of https://github.com/allenai/LLM int…

b9ed52c

…o mitchish65-2-gqa

We do batch size warmup.

5bf7211

Offline HF Datasets

559ab28

Adds a config that runs against data in S3

96abd69

epwalsh and others added 9 commits March 14, 2024 18:57

welp, back to our own checkpointing

6103643

Make find_latest_checkpoint more robust

9ed5965

clean up config

8133f58

Merge pull request #492 from allenai/epwalsh/mitchish65-2-gqa

026c26c

Updates to 70B config and checkpointing

update evals

ca0fe2b

Revert metric name change

856860d

update name and branch

b1e3855

Merge pull request #443 from allenai/mitchish65-2-gqa

fa8ec33

GQA into Mitchich65

Merge branch 'main' into mitchish65-2

286675e

epwalsh changed the title ~~Some more changes for the 65B run~~ Changes for 65B and 70B runs Mar 18, 2024

epwalsh added 12 commits March 18, 2024 12:09

Skip flash-attn tests when CUDA not available

12755a4

Add new evals

d3f0e41

Add QKV clipping

bc9fe29

Don't warm up cache

832bd31

Allow nightly

fc64b77

revert torch version

74d0527

Add --safe-tensors option to unshard.py

6a97372

update evals

d31f14c

Fix?

888fc44

fix filename

169b7b8

fix

28719d8

Update configs, add tokenizer

824c985

dirkgr commented Mar 19, 2024

View reviewed changes

epwalsh added 5 commits March 18, 2024 19:10

update eos token ID

8e80f23

Revert debug

f8580eb

Comment

c6c114d

update beaker launch script

92d2a08

final touches

ed0c5b1

- Update cluster - Up-sample wikipedia

epwalsh merged commit 74de51d into main Mar 19, 2024
11 checks passed

epwalsh deleted the mitchish65-2 branch March 19, 2024 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes for 65B and 70B runs #414

Changes for 65B and 70B runs #414

dirkgr commented Jan 24, 2024

Muennighoff left a comment

Muennighoff Feb 7, 2024

dirkgr Feb 7, 2024

epwalsh Feb 7, 2024

dirkgr Feb 7, 2024

epwalsh left a comment

epwalsh Feb 7, 2024

epwalsh Feb 7, 2024

epwalsh Feb 7, 2024

dirkgr left a comment

dirkgr Mar 19, 2024

epwalsh Mar 19, 2024

dirkgr Mar 19, 2024

epwalsh Mar 19, 2024

dirkgr Mar 19, 2024

epwalsh Mar 19, 2024

	The number of heads to use for keys and values.
	The number of heads to use for keys and values. Defaults to `n_heads`.

		if hasattr(new_config, "optimizer"):
		new_config.optimizer = OptimizerConfig.update_legacy_settings(new_config.optimizer)

		@@ -309,8 +317,7 @@ class ModelConfig(BaseConfig):

		multi_query_attention: bool = False

Changes for 65B and 70B runs #414

Changes for 65B and 70B runs #414

Conversation

dirkgr commented Jan 24, 2024

Muennighoff left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment