-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features to match OpenLM #302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about caching, because it seems to screw up FSDP every time we try it. So let's make sure this runs in LUMI before we merge.
Agreed, we should definitely test on LUMI before merging. I'm not too worried about these changes though because we've been doing the same thing with the ALiBi bias. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments, since this is still a draft
Avoiding buffers? Why does that make a difference?
…On Tue, Oct 3, 2023, 19:15 Pete ***@***.***> wrote:
After avoiding buffers with RoPE there is a huge improvement!
image.png (view on web)
<https://github.com/allenai/LLM/assets/8812459/2d62d55a-c19e-415d-ba93-9243a0f0e386>
—
Reply to this email directly, view it on GitHub
<#302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHAYPWLO4E4LFEQYYKYLTTX5TBFJAVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGAZDINBTGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
My first thought was that it's because buffers are stored in bf16 with our FSDP settings, so we lose some precision when RoPE is applied. And I think this is still true, but I found a bigger issue. It turns out using meta device deferred initialization introduced a bug with our RoPE "inv_freq" buffer. This buffer is initialized when its module is initialized, but there's no data in the buffer since it's a meta-device tensor, and later on FSDP calls https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251 |
Does that mean all my Rope experiments didn't really work?
…On Tue, Oct 3, 2023, 20:48 Pete ***@***.***> wrote:
Avoiding buffers? Why does that make a difference?
My first thought was that it's because buffers are stored in bf16 with our
FSDP settings, so we lose some precision when RoPE is applied. And I think
this is still true, but I found a bigger issue.
It turns out using meta device deferred initialization introduced a bug
with our RoPE "inv_freq" buffer. This buffer is initialized when its module
is initialized, but there's no data in the buffer since it's a meta-device
tensor, and later on FSDP calls Module.to_empty() which then causes those
buffers to materialize to all zeros. In other words, when
model.init_device is set to "meta", this line is essentially ignored and
instead "inv_freq" ends up all zeros:
https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251
—
Reply to this email directly, view it on GitHub
<#302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHAYPXTWEFHQBJUJ5JFF23X5TL77AVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGA4DQNZZHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Isn't the Alibi stuff stored the same way?
…On Tue, Oct 3, 2023, 21:01 Dirk Groeneveld ***@***.***> wrote:
Does that mean all my Rope experiments didn't really work?
On Tue, Oct 3, 2023, 20:48 Pete ***@***.***> wrote:
> Avoiding buffers? Why does that make a difference?
>
> My first thought was that it's because buffers are stored in bf16 with
> our FSDP settings, so we lose some precision when RoPE is applied. And I
> think this is still true, but I found a bigger issue.
>
> It turns out using meta device deferred initialization introduced a bug
> with our RoPE "inv_freq" buffer. This buffer is initialized when its module
> is initialized, but there's no data in the buffer since it's a meta-device
> tensor, and later on FSDP calls Module.to_empty() which then causes
> those buffers to materialize to all zeros. In other words, when
> model.init_device is set to "meta", this line is essentially ignored and
> instead "inv_freq" ends up all zeros:
>
>
> https://github.com/allenai/LLM/blob/602968ae92294b5eeb70e7422d073cb0183166fd/olmo/model.py#L251
>
> —
> Reply to this email directly, view it on GitHub
> <#302 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAHAYPXTWEFHQBJUJ5JFF23X5TL77AVCNFSM6AAAAAA5MXDSG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBWGA4DQNZZHE>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
|
No, the ALiBi bias is stored differently. We realized early on that buffers didn't work well for some reason and we made that fix with ALiBi, but I guess we never fixed the same issue with RoPE because we weren't using it at the time.
I think so. We should run those again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe changes because of the sequence length thing with Rope?
configs/mcli/v1-mix-medium.yaml
Outdated
ssh_clone: true | ||
command: |- | ||
pip install urllib3==1.26.17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was MosaicML's recommendation for solving the SSLError. Didn't work, I removed it.
return pos_sin, pos_cos | ||
|
||
def forward(self, q: torch.Tensor, k: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]: | ||
q_, k_ = q.float(), k.float() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're messing with precision here, do we need to disable autocast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any of the operations with this forward method would autocast to bf16, but just to make sure: 62fcb47
Adds some features to allow us to more closely match the architecture from Mitchell's OpenLM.
--model.weight_tying
(bool
, defaults toTrue
) that allows us to disable weight tying of the input embedding with the output linear.mlp_ratio * d_model
) to a multiple of 256 like Mitchell does here.The configuration I'm running in my
mitch-ish
runs (see W&B) is relatively slow compared to our defaults, but there are some low-hanging improvements we can make:apply_rotary_pos_emb
function like Mitchell does. Can we trust torchscript on AMD? We'll find out.Other changes:
RotaryEmbedding
module. This makes way more sense in my opinion and simplifies theOlmoBlock.attention
implementation. (7fc33c5)