Draft PR Adding mistral 0.1 #1131

AIproj · 2024-01-25T08:32:12Z

Here's the PR of the October addition of support for Mistral 7B v0.1 in GPT-NeoX, referred to in issue 1050.

Among other things, this PR also adds support for sliding window attention in GPT-NeoX, both through FlashAttention2 and through Megatron.

An example script is included to show how to run the conversion of a HuggingFace (HF) Mistral 7B v0.1 model into corresponding GPT-NeoX checkpoints.

The items left to do since then are to:

Add support for PipelineEngine in the HF -> GPT-NeoX conversion script (currently, pp>0 is not supported).
Test training through HF to check that on enwik8 the loss is also ~3 going down quickly to 2.xx.
Try to lm-eval the GPT-NeoX and HF versions to see if they match in performance.

Note: issue 1124 recently added support for conversion back to HF to enable such testing and is also concerned with supporting pp>0 in the conversion scripts.

:

haileyschoelkopf · 2024-02-22T21:15:02Z

This is ready for review!

Something might be up with the self-hosted runner for tests? it seems to not have the proper packages installed, including pytest

malteos · 2024-02-29T10:21:30Z

Does this require flash attention >= 2.3? Sliding window attention is only available from that version (see https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#23-local-ie-sliding-window-attention).

With the current Docker image (flash-attn==2.2.1) I get the following error:

...
  File "/netscratch/mostendorff/experiments/gpt-neox/megatron/model/transformer.py", line 622, in flash_attention
    output = self.flash_qkv_fn(
TypeError: flash_attn_func() got an unexpected keyword argument 'window_size'

haileyschoelkopf · 2024-02-29T12:07:13Z

Ah yes you’re correct— I will add a check for this in #1162 .

zhangir-azerbayev and others added 30 commits August 9, 2023 03:27

add support for flash attention 2

ab38f60

change cosine decay to chinchilla style

840c09f

set default warmup to none so that warmup_iters can be set

ae26360

fixed bug

bf4cab5

fixed chinchilla lr

ff86462

add s3 checkpoint syncing

757320b

Merge branch 'add-s3-ckpting' into math-lm-2

a765819

rotary embedding in fp32

8a11029

fix for seq_len < max_seq_len

d869e47

some fixes, still not working

52ba5e4

?'

dfedf05

:

Merge branch 'fix_rotary_precision' into math-lm-2-rotary

e680f68

fix bugs; evaluate on step 0

fcbd8a1

Merge branch 'math-lm-2' into math-lm-2-rotary

416bafa

first attempt at gqa

334bbd5

gqa works in kv_heads==query_heads case

3c8616f

gqa working

e59c873

workaround for FSX quota

801192e

update with llemma

e52b749

update with recent PR

6bc724b

README and requirements updated

48d394e

Added Mistral config

694bc7f

Added sliding window through flash attention 2

612de29

Added sliding window

9bd58f1

Mistral should likely use mp=2 like llama2

d5d90dc

Update gitignore

67638e1

Removed unused CPCargo import

b521408

Conversion script (WIP)

c842ea9

Fixed missing slurm environ vars

aa50fd1

updated mistral config

6a86310

haileyschoelkopf and others added 20 commits February 13, 2024 19:08

cleanup: no longer block out GQA codepaths in conversion scripts

23b7577

Merge branch 'main' into adding-mistral-0.1

9704976

cleanup: gqa code a bit

54135b4

add llama2, llemma configs

594d926

add non-flash GQA ; refactor modeling code

0827bb8

clean up mistral config for commit

558bdd8

further cleanup configs dir

726935f

remove slurm script from llemma

4cec223

update seqlen params for codellama, llemma configs

eca632d

add more comments to GQA code, and make reshapes more readable

b07e63a

make inv_freq non-persistent

f0dcf17

actually, just ensure mistral has inv_freqs as a persistent buffer

95afe82

non-flash GQA works, so ensure arguments.py permits it

5cfe8ee

no longer use our own copies of flash attention interface functions

627a287

remove unused mpu util fn

63c2fbe

delete unused config file

e768492

fix diff on mpu/utils.py

caa440d

remove slurm scripts that won't be in this PR

74fde98

run pre-commit

e7d1282

Merge remote-tracking branch 'upstream/main' into adding-mistral-0.1

50ed9b5

haileyschoelkopf marked this pull request as ready for review February 22, 2024 20:57

update tests for conversion scripts

ace0e94

Quentin-Anthony added 2 commits February 22, 2024 21:35

add flash version check for sliding window

157ec47

pre-commit

db9947e

Quentin-Anthony approved these changes Feb 23, 2024

View reviewed changes

Quentin-Anthony merged commit f36aed7 into EleutherAI:main Feb 23, 2024
2 of 5 checks passed

haileyschoelkopf mentioned this pull request Feb 26, 2024

Support Mistral Models #1050

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft PR Adding mistral 0.1 #1131

Draft PR Adding mistral 0.1 #1131

AIproj commented Jan 25, 2024

haileyschoelkopf commented Feb 22, 2024

malteos commented Feb 29, 2024

haileyschoelkopf commented Feb 29, 2024 •

edited

Loading

Draft PR Adding mistral 0.1 #1131

Draft PR Adding mistral 0.1 #1131

Conversation

AIproj commented Jan 25, 2024

haileyschoelkopf commented Feb 22, 2024

malteos commented Feb 29, 2024

haileyschoelkopf commented Feb 29, 2024 • edited Loading

haileyschoelkopf commented Feb 29, 2024 •

edited

Loading