Features to match OpenLM #302

epwalsh · 2023-09-29T16:41:08Z

Adds some features to allow us to more closely match the architecture from Mitchell's OpenLM.

Add option --model.weight_tying (bool, defaults to True) that allows us to disable weight tying of the input embedding with the output linear.
Add option to restrict block feed-forward hidden dimension (usually mlp_ratio * d_model) to a multiple of 256 like Mitchell does here.

The configuration I'm running in my mitch-ish runs (see W&B) is relatively slow compared to our defaults, but there are some low-hanging improvements we can make:

Cache the RoPE sin and cos of the positions. It's silly that we're not doing this already. (160d143).
Try torch-scripting the apply_rotary_pos_emb function like Mitchell does. Can we trust torchscript on AMD? We'll find out.

Other changes:

Move all the RoPE logic to the RotaryEmbedding module. This makes way more sense in my opinion and simplifies the OlmoBlock.attention implementation. (7fc33c5)

dirkgr

I worry about caching, because it seems to screw up FSDP every time we try it. So let's make sure this runs in LUMI before we merge.

epwalsh · 2023-09-29T21:31:52Z

I worry about caching, because it seems to screw up FSDP every time we try it. So let's make sure this runs in LUMI before we merge.

Agreed, we should definitely test on LUMI before merging. I'm not too worried about these changes though because we've been doing the same thing with the ALiBi bias.

2015aroras

Some initial comments, since this is still a draft

olmo/model.py

epwalsh · 2023-10-04T02:15:36Z

After avoiding buffers with RoPE there is a huge improvement!

dirkgr · 2023-10-04T02:36:55Z

Avoiding buffers? Why does that make a difference?

…

On Tue, Oct 3, 2023, 19:15 Pete ***@***.***> wrote: After avoiding buffers with RoPE there is a huge improvement! image.png (view on web) <https://github.com/allenai/LLM/assets/8812459/2d62d55a-c19e-415d-ba93-9243a0f0e386> — Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPWLO4E4LFEQYYKYLTTX5TBFJAVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGAZDINBTGI> . You are receiving this because you commented.Message ID: ***@***.***>

epwalsh · 2023-10-04T03:48:05Z

Avoiding buffers? Why does that make a difference?

My first thought was that it's because buffers are stored in bf16 with our FSDP settings, so we lose some precision when RoPE is applied. And I think this is still true, but I found a bigger issue.

It turns out using meta device deferred initialization introduced a bug with our RoPE "inv_freq" buffer. This buffer is initialized when its module is initialized, but there's no data in the buffer since it's a meta-device tensor, and later on FSDP calls Module.to_empty() which then causes those buffers to materialize to all zeros. In other words, when model.init_device is set to "meta", this line is essentially ignored and instead "inv_freq" ends up all zeros:

https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251

dirkgr · 2023-10-04T04:01:56Z

Does that mean all my Rope experiments didn't really work?

…

On Tue, Oct 3, 2023, 20:48 Pete ***@***.***> wrote: Avoiding buffers? Why does that make a difference? My first thought was that it's because buffers are stored in bf16 with our FSDP settings, so we lose some precision when RoPE is applied. And I think this is still true, but I found a bigger issue. It turns out using meta device deferred initialization introduced a bug with our RoPE "inv_freq" buffer. This buffer is initialized when its module is initialized, but there's no data in the buffer since it's a meta-device tensor, and later on FSDP calls Module.to_empty() which then causes those buffers to materialize to all zeros. In other words, when model.init_device is set to "meta", this line is essentially ignored and instead "inv_freq" ends up all zeros: https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251 — Reply to this email directly, view it on GitHub <#302 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHAYPXTWEFHQBJUJ5JFF23X5TL77AVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGA4DQNZZHE> . You are receiving this because you commented.Message ID: ***@***.***>

dirkgr · 2023-10-04T04:02:31Z

Isn't the Alibi stuff stored the same way?

…

On Tue, Oct 3, 2023, 21:01 Dirk Groeneveld ***@***.***> wrote: Does that mean all my Rope experiments didn't really work? On Tue, Oct 3, 2023, 20:48 Pete ***@***.***> wrote: > Avoiding buffers? Why does that make a difference? > > My first thought was that it's because buffers are stored in bf16 with > our FSDP settings, so we lose some precision when RoPE is applied. And I > think this is still true, but I found a bigger issue. > > It turns out using meta device deferred initialization introduced a bug > with our RoPE "inv_freq" buffer. This buffer is initialized when its module > is initialized, but there's no data in the buffer since it's a meta-device > tensor, and later on FSDP calls Module.to_empty() which then causes > those buffers to materialize to all zeros. In other words, when > model.init_device is set to "meta", this line is essentially ignored and > instead "inv_freq" ends up all zeros: > > > https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251 > > — > Reply to this email directly, view it on GitHub > <#302 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAHAYPXTWEFHQBJUJ5JFF23X5TL77AVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGA4DQNZZHE> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

epwalsh · 2023-10-04T04:42:44Z

Isn't the Alibi stuff stored the same way?

No, the ALiBi bias is stored differently. We realized early on that buffers didn't work well for some reason and we made that fix with ALiBi, but I guess we never fixed the same issue with RoPE because we weren't using it at the time.

Does that mean all my Rope experiments didn't really work?

I think so. We should run those again.

configs/mcli/v1-mix-medium-mitch-ish.yaml

olmo/model.py

dirkgr

Maybe changes because of the sequence length thing with Rope?

dirkgr · 2023-10-09T20:17:07Z

configs/mcli/v1-mix-medium.yaml

    ssh_clone: true
 command: |-
+  pip install urllib3==1.26.17


What is this for?

This was MosaicML's recommendation for solving the SSLError. Didn't work, I removed it.

olmo/model.py

dirkgr · 2023-10-09T21:31:02Z

olmo/model.py

+        return pos_sin, pos_cos
+
+    def forward(self, q: torch.Tensor, k: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        q_, k_ = q.float(), k.float()


Since we're messing with precision here, do we need to disable autocast?

I don't think any of the operations with this forward method would autocast to bf16, but just to make sure: 62fcb47

epwalsh added 5 commits September 28, 2023 09:36

cast logits to fp32 before output

c6b5ee9

revert logit manual cast

27e7c84

add option for no weight tying

1850ebb

init ff_out weights

b852ec4

fix init device for ff_out

a2f1517

epwalsh mentioned this pull request Sep 29, 2023

Debugging the Spikyness of the 7B #288

Closed

10 tasks

epwalsh and others added 6 commits September 29, 2023 11:20

refactor how we cache buffers

387f659

cache rope sin and cos

160d143

Refactor how RoPE is applied

7fc33c5

remove unused import

ee95fd3

Add back Olmo.device property

95c806c

Merge branch 'main' into petew/tweaks

299b5cc

dirkgr reviewed Sep 29, 2023

View reviewed changes

2015aroras reviewed Oct 2, 2023

View reviewed changes

olmo/model.py Show resolved Hide resolved

olmo/model.py Show resolved Hide resolved

olmo/model.py Outdated Show resolved Hide resolved

olmo/model.py Outdated Show resolved Hide resolved

epwalsh added 2 commits October 2, 2023 10:28

give cache a type, make it required in constructors

5a628a3

Merge branch 'main' into petew/tweaks

ba80eba

epwalsh marked this pull request as ready for review October 5, 2023 22:14

epwalsh and others added 6 commits October 6, 2023 08:40

Merge branch 'main' into petew/tweaks

0e6dfcd

add mitch config

b331f8b

add option to override hidden size

75c5813

MCLI configs

b9805ff

rename config option to mlp_hidden_size

36370d0

don't use adaptive clipping

7e8b88f

epwalsh added 11 commits October 6, 2023 11:53

enable flash

f51b04e

update configs

de4ba36

apply rotary in FP32

20aca2a

clean up

c47ab78

Add v1.5 mix mitch-ish

e8be916

Add option to disable SSL with requests to S3

d99da62

schedule

119d4ec

Add LUMI config for mitch

99e3729

fix

4829516

no save overwrite

f2813ae

fix affine qnorm config

933c6ad

epwalsh mentioned this pull request Oct 8, 2023

Fix affine qnorm config #319

Merged

epwalsh added 4 commits October 8, 2023 12:32

Merge branch 'main' into petew/tweaks

d22ef89

Don't download if we don't have to

c110059

update configs

7906bd2

remove duplicate field

40cddcd

epwalsh requested review from dirkgr and 2015aroras October 9, 2023 18:43

clean up

d97c172

2015aroras approved these changes Oct 9, 2023

View reviewed changes

configs/mcli/v1-mix-medium-mitch-ish.yaml Outdated Show resolved Hide resolved

olmo/model.py Outdated Show resolved Hide resolved

remove urllib3 pin

f42d0ac

dirkgr requested changes Oct 9, 2023

View reviewed changes

Ensure RoPE applied in full precision

62fcb47

epwalsh requested a review from dirkgr October 9, 2023 22:18

make rope helpers instance methods

d9fb29c

dirkgr approved these changes Oct 9, 2023

View reviewed changes

epwalsh added 2 commits October 9, 2023 18:01

Add note about attention mask

580f2e9

clean up

657021c

epwalsh merged commit fddded5 into main Oct 10, 2023
10 checks passed

epwalsh deleted the petew/tweaks branch October 10, 2023 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features to match OpenLM #302

Features to match OpenLM #302

epwalsh commented Sep 29, 2023 •

edited

Loading

dirkgr left a comment

epwalsh commented Sep 29, 2023

2015aroras left a comment

epwalsh commented Oct 4, 2023

dirkgr commented Oct 4, 2023 via email

epwalsh commented Oct 4, 2023

dirkgr commented Oct 4, 2023 via email

dirkgr commented Oct 4, 2023 via email

epwalsh commented Oct 4, 2023

dirkgr left a comment

dirkgr Oct 9, 2023

epwalsh Oct 9, 2023

dirkgr Oct 9, 2023

epwalsh Oct 9, 2023

Features to match OpenLM #302

Features to match OpenLM #302

Conversation

epwalsh commented Sep 29, 2023 • edited Loading

dirkgr left a comment

Choose a reason for hiding this comment

epwalsh commented Sep 29, 2023

2015aroras left a comment

Choose a reason for hiding this comment

epwalsh commented Oct 4, 2023

dirkgr commented Oct 4, 2023 via email

epwalsh commented Oct 4, 2023

dirkgr commented Oct 4, 2023 via email

dirkgr commented Oct 4, 2023 via email

epwalsh commented Oct 4, 2023

dirkgr left a comment

Choose a reason for hiding this comment

dirkgr Oct 9, 2023

Choose a reason for hiding this comment

epwalsh Oct 9, 2023

Choose a reason for hiding this comment

dirkgr Oct 9, 2023

Choose a reason for hiding this comment

epwalsh Oct 9, 2023

Choose a reason for hiding this comment

epwalsh commented Sep 29, 2023 •

edited

Loading