Merge with upstream #48

Quentin-Anthony · 2024-01-15T18:32:40Z

No description provided.

…PE tests to use fusion Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Xin Yao <[email protected]>

Correct strides for bshd layout and revert RoPE tests to use fusion See merge request ADLR/megatron-lm!2469

Co-authored-by: Boxin Wang <[email protected]>

Add model checkpoint links See merge request ADLR/megatron-lm!2494

…ks that don't have trainable params Co-authored-by: Jon Barker <[email protected]>

Support --freeze-LM and --freeze-ViT with ranks that don't have trainable params See merge request ADLR/megatron-lm!2285

Move mmodal evaluation code to its own folder See merge request ADLR/megatron-lm!2491

Co-authored-by: Huy Vu2 <[email protected]>

Updating T5 codes to fix bugs See merge request ADLR/megatron-lm!2471

ci: Add memory consumption to tests See merge request ADLR/megatron-lm!2467

…m norm in a memory-efficient way

Reuse optimizer's main_params to compute param norm in a memory-efficient way See merge request ADLR/megatron-lm!2483

Co-authored-by: Oliver Koenig <[email protected]>

Add NeMo MoE test. See merge request ADLR/megatron-lm!2460

ci: Move most of LTS tests to nightly See merge request ADLR/megatron-lm!2496

Video training See merge request ADLR/megatron-lm!2500

ci: Update golden values of nightlies See merge request ADLR/megatron-lm!2511

…r newly added requests

Make generate function only return results for newly added requests See merge request ADLR/megatron-lm!2370

ci: Use torchrun See merge request ADLR/megatron-lm!2507

chore: Fix local generator script See merge request ADLR/megatron-lm!2519

Co-authored-by: William Dykas <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]>

[dist ckpt] Remove alias LocalNonpersitentObject See merge request ADLR/megatron-lm!2458

…and_type` during saving

[dist ckpt] Resolve todos in `_split_by_size_and_type` during saving See merge request ADLR/megatron-lm!2623

Co-authored-by: Mcore Bot <[email protected]>

Embedder fix for radio See merge request ADLR/megatron-lm!2651

tests: test_builder See merge request ADLR/megatron-lm!2670

Llava unit test fix See merge request ADLR/megatron-lm!2669

…ed with local checkpointing.

Warn instead of error when model_opt is enabled with local checkpointing. See merge request ADLR/megatron-lm!2667

Reduce NCCL memory cost in UT See merge request ADLR/megatron-lm!2633

Enabling UCC backend for PP communication See merge request ADLR/megatron-lm!2116

… with freeze VIT

Fix for Frozen QK LayerNorm when training VLM with freeze VIT See merge request ADLR/megatron-lm!2632

chore: Bump version See merge request ADLR/megatron-lm!2680

…n VLM example

Basic context and sequence parallel support in VLM example See merge request ADLR/megatron-lm!2561

Co-authored-by: jianbinc <[email protected]> Co-authored-by: 顾慎 <[email protected]> Co-authored-by: 李鹏 <[email protected]> Co-authored-by: 黄俊 <[email protected]> Co-authored-by: lostkevin <[email protected]> Co-authored-by: lostkevin <[email protected]> Co-authored-by: root <[email protected]>

Optimizer CPU offload support See merge request ADLR/megatron-lm!2526

Fix distributed checkpoint tests See merge request ADLR/megatron-lm!2675

Co-authored-by: Slawomir Kierat <[email protected]> Co-authored-by: Helen Ngo <[email protected]> Co-authored-by: Mcore Bot <[email protected]>

Statically allocate KV cache for MCore inference See merge request ADLR/megatron-lm!2641

Co-authored-by: Deepak Narayanan <[email protected]> Co-authored-by: Cyril Meurillon <[email protected]>

Various improvements to RerunStateMachine See merge request ADLR/megatron-lm!2659

Quentin-Anthony self-assigned this Jan 15, 2024

mathemakitten and others added 29 commits December 20, 2024 21:51

ADLR/megatron-lm!2469 - Correct strides for bshd layout and revert Ro…

7bb5379

…PE tests to use fusion Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: Xin Yao <[email protected]>

Merge branch 'helenn-rope-fusion-mem-layout' into 'main'

25b1f33

Correct strides for bshd layout and revert RoPE tests to use fusion See merge request ADLR/megatron-lm!2469

ADLR/megatron-lm!2494 - Add model checkpoint links

1da9dad

Co-authored-by: Boxin Wang <[email protected]>

Merge branch 'boxin/nvlm_ckpt_release' into 'main'

cf25d44

Add model checkpoint links See merge request ADLR/megatron-lm!2494

ADLR/megatron-lm!2285 - Support --freeze-LM and --freeze-ViT with ran…

1468ab0

…ks that don't have trainable params Co-authored-by: Jon Barker <[email protected]>

Merge branch 'jbarker/pp_unfreeze' into 'main'

d3c585e

Support --freeze-LM and --freeze-ViT with ranks that don't have trainable params See merge request ADLR/megatron-lm!2285

ADLR/megatron-lm!2491 - Move mmodal evaluation code to its own folder

e51a3ac

Merge branch 'mmodal_eval_in_folder' into 'main'

2da43ef

Move mmodal evaluation code to its own folder See merge request ADLR/megatron-lm!2491

ADLR/megatron-lm!2471 - Updating T5 codes to fix bugs

48103f4

Co-authored-by: Huy Vu2 <[email protected]>

Merge branch 'huvu/t5_fixes_updates' into 'main'

076972e

Updating T5 codes to fix bugs See merge request ADLR/megatron-lm!2471

ADLR/megatron-lm!2467 - ci: Add memory consumption to tests

9238a5e

Merge branch 'ko3n1g/tests/add-memory-consumption' into 'main'

24e0126

ci: Add memory consumption to tests See merge request ADLR/megatron-lm!2467

ADLR/megatron-lm!2483 - Reuse optimizer's main_params to compute para…

079dc66

…m norm in a memory-efficient way

Merge branch 'dnarayanan/fix_param_norm_memory_main' into 'main'

f682bd0

Reuse optimizer's main_params to compute param norm in a memory-efficient way See merge request ADLR/megatron-lm!2483

ADLR/megatron-lm!2460 - Add NeMo MoE test.

a6ba070

Co-authored-by: Oliver Koenig <[email protected]>

Merge branch 'denliu/moe_nemo_test' into 'main'

30ffe88

Add NeMo MoE test. See merge request ADLR/megatron-lm!2460

ADLR/megatron-lm!2496 - ci: Move most of LTS tests to nightly

47b8470

Merge branch 'ko3n1g/ci/prune-tests' into 'main'

2d7c521

ci: Move most of LTS tests to nightly See merge request ADLR/megatron-lm!2496

ADLR/megatron-lm!2500 - Video training

82a6dfd

Merge branch 'video_training' into 'main'

86e5481

Video training See merge request ADLR/megatron-lm!2500

ADLR/megatron-lm!2511 - ci: Update golden values of nightlies

c383fe9

Merge branch 'ko3n1g/ci/update-nightlies' into 'main'

15517f6

ci: Update golden values of nightlies See merge request ADLR/megatron-lm!2511

ADLR/megatron-lm!2370 - Make generate function only return results fo…

342e359

…r newly added requests

Merge branch 'generate_fix' into 'main'

df28200

Make generate function only return results for newly added requests See merge request ADLR/megatron-lm!2370

ADLR/megatron-lm!2507 - ci: Use torchrun

6e09dd4

Merge branch 'ko3n1g/ci/use-torchrun' into 'main'

ab171c5

ci: Use torchrun See merge request ADLR/megatron-lm!2507

ADLR/megatron-lm!2519 - chore: Fix local generator script

c8d12e6

Merge branch 'ko3n1g/chore/fix-local-generator-script' into 'main'

65720c8

chore: Fix local generator script See merge request ADLR/megatron-lm!2519

ADLR/megatron-lm!2430 - Fix log probs output for inference

5ff34d0

Co-authored-by: William Dykas <[email protected]> Co-authored-by: Mcore Bot <[email protected]> Co-authored-by: William Dykas <[email protected]> Co-authored-by: root <[email protected]>

ananthsub and others added 30 commits February 11, 2025 11:42

ADLR/megatron-lm!2458 - [dist ckpt] Remove alias LocalNonpersitentObject

8f816d4

Merge branch 'remove-persistent-alias-ckpt' into 'main'

f2f8101

[dist ckpt] Remove alias LocalNonpersitentObject See merge request ADLR/megatron-lm!2458

ADLR/megatron-lm!2623 - [dist ckpt] Resolve todos in `_split_by_size_…

54e1db0

…and_type` during saving

Merge branch 'ckpt-split-by-size-type' into 'main'

79e9894

[dist ckpt] Resolve todos in `_split_by_size_and_type` during saving See merge request ADLR/megatron-lm!2623

ADLR/megatron-lm!2651 - Embedder fix for radio

1878be3

Co-authored-by: Mcore Bot <[email protected]>

Merge branch 'tpoon/radio_fix_mr' into 'main'

bfd1840

Embedder fix for radio See merge request ADLR/megatron-lm!2651

ADLR/megatron-lm!2670 - tests: test_builder

77e3593

Merge branch 'ko3n1g/tests/data-builder' into 'main'

aa719a0

tests: test_builder See merge request ADLR/megatron-lm!2670

ADLR/megatron-lm!2669 - Llava unit test fix

68b6119

Merge branch 'trintamaki/llava-unit-test-fix' into 'main'

55cdfc1

Llava unit test fix See merge request ADLR/megatron-lm!2669

ADLR/megatron-lm!2667 - Warn instead of error when model_opt is enabl…

850ac6d

…ed with local checkpointing.

Merge branch 'skierat/local_vs_model_opt' into 'main'

eb7092e

Warn instead of error when model_opt is enabled with local checkpointing. See merge request ADLR/megatron-lm!2667

ADLR/megatron-lm!2633 - Reduce NCCL memory cost in UT

7e748bf

Merge branch 'denliu/reduce_ut_memory' into 'main'

8ca9e57

Reduce NCCL memory cost in UT See merge request ADLR/megatron-lm!2633

ADLR/megatron-lm!2116 - Enabling UCC backend for PP communication

50d8475

Merge branch 'ucc_work' into 'main'

5b47af6

Enabling UCC backend for PP communication See merge request ADLR/megatron-lm!2116

ADLR/megatron-lm!2632 - Fix for Frozen QK LayerNorm when training VLM…

f8ed25c

… with freeze VIT

Merge branch 'pmannan/fix_qk_ln_freeze' into 'main'

ac3884a

Fix for Frozen QK LayerNorm when training VLM with freeze VIT See merge request ADLR/megatron-lm!2632

ADLR/megatron-lm!2680 - chore: Bump version

09e76b9

Merge branch 'ko3n1g/chore/bump' into 'main'

5575cfc

chore: Bump version See merge request ADLR/megatron-lm!2680

ADLR/megatron-lm!2561 - Basic context and sequence parallel support i…

7c2239a

…n VLM example

Merge branch 'trintamaki/vlm-example-cp-sp' into 'main'

78fc935

Basic context and sequence parallel support in VLM example See merge request ADLR/megatron-lm!2561

Merge branch 'optimizer_cpu_offload_poc' into 'main'

3364154

Optimizer CPU offload support See merge request ADLR/megatron-lm!2526

ADLR/megatron-lm!2675 - Fix distributed checkpoint tests

4052d61

Merge branch 'skierat/fix_test_local' into 'main'

d6985c4

Fix distributed checkpoint tests See merge request ADLR/megatron-lm!2675

ADLR/megatron-lm!2641 - Statically allocate KV cache for MCore inference

4d3d6b2

Co-authored-by: Slawomir Kierat <[email protected]> Co-authored-by: Helen Ngo <[email protected]> Co-authored-by: Mcore Bot <[email protected]>

Merge branch 'static_inference_params' into 'main'

6673956

Statically allocate KV cache for MCore inference See merge request ADLR/megatron-lm!2641

ADLR/megatron-lm!2659 - Various improvements to RerunStateMachine

4dc6b71

Co-authored-by: Deepak Narayanan <[email protected]> Co-authored-by: Cyril Meurillon <[email protected]>

Merge branch 'fix-backward-checkpoint' into 'main'

9a496c9

Various improvements to RerunStateMachine See merge request ADLR/megatron-lm!2659

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge with upstream #48

Merge with upstream #48

Quentin-Anthony commented Jan 15, 2024

Merge with upstream #48

Are you sure you want to change the base?

Merge with upstream #48

Conversation

Quentin-Anthony commented Jan 15, 2024