Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with upstream #48

Open
wants to merge 2,889 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
2889 commits
Select commit Hold shift + click to select a range
342e359
ADLR/megatron-lm!2370 - Make generate function only return results fo…
santhnm2 Jan 7, 2025
df28200
Merge branch 'generate_fix' into 'main'
ko3n1g Jan 7, 2025
6e09dd4
ADLR/megatron-lm!2507 - ci: Use torchrun
ko3n1g Jan 7, 2025
ab171c5
Merge branch 'ko3n1g/ci/use-torchrun' into 'main'
ko3n1g Jan 7, 2025
c8d12e6
ADLR/megatron-lm!2519 - chore: Fix local generator script
ko3n1g Jan 7, 2025
65720c8
Merge branch 'ko3n1g/chore/fix-local-generator-script' into 'main'
ko3n1g Jan 7, 2025
5ff34d0
ADLR/megatron-lm!2430 - Fix log probs output for inference
wdykas Jan 7, 2025
4dc8977
Merge branch 'wdykas/fix-logprobs' into 'main'
jaredcasper Jan 7, 2025
c99a5fe
ADLR/megatron-lm!2489 - Add tests for MoE models with average_in_coll…
guyueh1 Jan 8, 2025
ad41174
Merge branch 'add_test_for_average_in_collective_ddp' into 'main'
ko3n1g Jan 8, 2025
6ce0da5
ADLR/megatron-lm!2509 - ci: Allow running nemo-ci
ko3n1g Jan 8, 2025
05780f3
Merge branch 'ko3n1g/ci/run-nemo-ci' into 'main'
ko3n1g Jan 8, 2025
9220838
ADLR/megatron-lm!2520 - ci: Fail-fast on unit tests
ko3n1g Jan 8, 2025
1ce944c
Merge branch 'ko3n1g/ci/fail-fast-unit-tests' into 'main'
ko3n1g Jan 8, 2025
72e86a6
ADLR/megatron-lm!2516 - remove tensorstore pin
pstjohn Jan 8, 2025
a26b93d
Merge branch 'pstjohn/remove-tensorstore-pin' into 'main'
ericharper Jan 8, 2025
67130c9
ADLR/megatron-lm!2522 - ci: nemo-ci inputs
ko3n1g Jan 8, 2025
bafab5a
Merge branch 'ko3n1g/ci/fix-inputs-to-nemo-ci' into 'main'
ko3n1g Jan 8, 2025
a852cb9
ADLR/megatron-lm!2428 - Adding (bias-based) relative position embeddi…
huvunvidia Jan 9, 2025
93cb1c1
Merge branch 'huvu/relative_posemd_attention_bias' into 'main'
jaredcasper Jan 9, 2025
fa93a05
ADLR/megatron-lm!2429 - Inference CUDA graphs (MCore version)
mathemakitten Jan 9, 2025
8fba594
Merge branch 'hn-inference-cudagraphs-mcore' into 'main'
jaredcasper Jan 9, 2025
458bfc9
ADLR/megatron-lm!2523 - Fix bug when loading pp>1 model with frozen l…
jon-barker Jan 9, 2025
3046e33
Merge branch 'jbarker/debug_pp_convert' into 'main'
jaredcasper Jan 9, 2025
f27a04f
ADLR/megatron-lm!2426 - Make MoE token dispatcher cuda graph-able if …
guyueh1 Jan 10, 2025
726da58
Merge branch 'graphable_token_dispatch' into 'main'
jaredcasper Jan 10, 2025
b41bcba
ADLR/megatron-lm!2514 - ci: Implement `frozen-ckpt` tests
ko3n1g Jan 12, 2025
c76410a
Merge branch 'ko3n1g/ci/frozen-ckpt' into 'main'
ko3n1g Jan 12, 2025
3d3d865
ADLR/megatron-lm!2479 - Add missing key mapping for mixtral TRT-LLM e…
Jan 13, 2025
6091272
Merge branch 'pikaminski/mcore_export_add_mixtral_mapping' into 'main'
jaredcasper Jan 13, 2025
0bd907f
ADLR/megatron-lm!2537 - ci: Exit FTs
ko3n1g Jan 13, 2025
6486203
Merge branch 'ko3n1g/ci/exit-fts' into 'main'
ko3n1g Jan 13, 2025
770c392
ADLR/megatron-lm!2454 - Fix typing issues with MCore inference API an…
santhnm2 Jan 14, 2025
c1cafa1
Merge branch 'ksanthanam/mcore_inference_typing_fixes' into 'main'
ko3n1g Jan 14, 2025
7f12787
ADLR/megatron-lm!2554 - Update 01.test.yml
ko3n1g Jan 15, 2025
699a0ec
Merge branch 'okoenig-main-patch-69275' into 'main'
ko3n1g Jan 15, 2025
4364bfb
ADLR/megatron-lm!2540 - ci: Retry on failed logs
ko3n1g Jan 15, 2025
004fbcb
Merge branch 'ko3n1g/ci/retry-on-missing-downloads' into 'main'
ko3n1g Jan 15, 2025
4aada1b
ADLR/megatron-lm!2541 - ci: Add frozen checkpoints
ko3n1g Jan 16, 2025
b835a10
Merge branch 'ko3n1g/ci/finalize-frozen-ckpt-tests' into 'main'
ko3n1g Jan 16, 2025
95eeffe
ADLR/megatron-lm!2543 - ci: Update docs
ko3n1g Jan 16, 2025
3e3f61d
Merge branch 'ko3n1g/ci/update-docs' into 'main'
ko3n1g Jan 16, 2025
cb678cc
ADLR/megatron-lm!2431 - Change Megatron text generation frontend to M…
mathemakitten Jan 17, 2025
e5793c0
Merge branch 'helenn-mcore-textgen-server' into 'main'
jaredcasper Jan 17, 2025
66a3a00
ADLR/megatron-lm!2548 - Fix packed sequence unit test
trintamaki Jan 17, 2025
0932494
Merge branch 'trintamaki/fix-packing-test' into 'main'
ko3n1g Jan 17, 2025
131acc2
ADLR/megatron-lm!2556 - Fix: total_flops value tracking and add one_l…
PytLab Jan 17, 2025
d1739b7
Merge branch 'main' into 'main'
deepakn94 Jan 17, 2025
61ed096
ADLR/megatron-lm!2563 - ci: Add `possibly-used-before-assignment`
ko3n1g Jan 17, 2025
60d245f
Merge branch 'ko3n1g/ci/add-e0606' into 'main'
ko3n1g Jan 17, 2025
f19858e
ADLR/megatron-lm!2546 - ci: Remove check out of src branch
ko3n1g Jan 17, 2025
c614252
Merge branch 'ko3n1g/ci/fix-autoformatter' into 'main'
ko3n1g Jan 17, 2025
f85b6b1
ADLR/megatron-lm!2562 - ci: Fetch exit-code
ko3n1g Jan 17, 2025
e02a860
Merge branch 'ko3n1g/ci/fetch-exitcode' into 'main'
ko3n1g Jan 17, 2025
4e87b4c
ADLR/megatron-lm!2534 - refactor: Make `get_mlp_module_spec` public
ko3n1g Jan 17, 2025
fa35226
Merge branch 'ko3n1g/refactor/make-get_mlp_module_spec-public' into '…
ko3n1g Jan 17, 2025
37a900f
ADLR/megatron-lm!2555 - set weight_only to False
dimapihtar Jan 17, 2025
c7bf403
Merge branch 'dpykhtar/fix_load_ckpt' into 'main'
jaredcasper Jan 17, 2025
57c392b
ADLR/megatron-lm!2571 - ci: Better output
ko3n1g Jan 18, 2025
7ba0d6d
Merge branch 'ko3n1g/ci/ci-output' into 'main'
ko3n1g Jan 18, 2025
9c11ab4
ADLR/megatron-lm!2510 - chore: Bump versions
ko3n1g Jan 20, 2025
4fb4c3d
Merge branch 'ko3n1g/chore/bump-versions' into 'main'
ko3n1g Jan 20, 2025
f29bf42
ADLR/megatron-lm!2572 - New GPT memory and speed tests
deepakn94 Jan 20, 2025
7dd2658
Merge branch 'dnarayanan/speed_and_functional_tests' into 'main'
ko3n1g Jan 20, 2025
e8336b1
ADLR/megatron-lm!2513 - add group_desc when invoking new_group()
sanshang-nv Jan 21, 2025
df70c00
Merge branch 'add_pg_desc' into 'main'
ko3n1g Jan 21, 2025
f950178
ADLR/megatron-lm!2576 - feat: Log `max-allocated-mem` to TB
ko3n1g Jan 21, 2025
f8887ce
Merge branch 'ko3n1g/ci/write-max-allocated-mem-to-tensorboard' into …
ko3n1g Jan 21, 2025
9bbb40d
ADLR/megatron-lm!2532 - Fix bug in !2426
guyueh1 Jan 22, 2025
f73f20c
Merge branch 'fix_moe_drop_and_pad' into 'main'
jaredcasper Jan 22, 2025
f8e4d27
ADLR/megatron-lm!2568 - Assert image token exists in multimodal example
Jan 22, 2025
4064396
Merge branch 'matthieul/fail_on_missing_token' into 'main'
jon-barker Jan 22, 2025
dc251d7
ADLR/megatron-lm!2578 - ci: Catch `UnicodeDecodeError`
ko3n1g Jan 22, 2025
33de8a5
Merge branch 'ko3n1g/ci/catch-log-error' into 'main'
ko3n1g Jan 22, 2025
cf1b0d4
ADLR/megatron-lm!2569 - Bug fix in get_pipeline_model_parallel_last_rank
Jan 22, 2025
564dbd7
Merge branch 'mike/pp_fix_1' into 'main'
ko3n1g Jan 22, 2025
2b61030
ADLR/megatron-lm!2550 - Add llama 3.1 support for mmodal example
Jan 22, 2025
ae1c43d
Merge branch 'add_llama_support' into 'main'
jon-barker Jan 22, 2025
ffdb6dc
ADLR/megatron-lm!2299 - chore: Bump PyT to 24.10
ko3n1g Jan 22, 2025
66e5306
Merge branch 'ko3n1g/chore/bump-pyt-24.10' into 'main'
ko3n1g Jan 22, 2025
e2a0e9e
ADLR/megatron-lm!2575 - Feature/wandb artifact log checkpoint
talorabr Jan 22, 2025
27756f4
Merge branch 'feature/wandb_artifact_log_checkpoint' into 'main'
ko3n1g Jan 22, 2025
4df1108
ADLR/megatron-lm!2579 - fix: Setup requirements
ko3n1g Jan 22, 2025
5c12382
Merge branch 'ko3n1g/fix/setup' into 'main'
ko3n1g Jan 22, 2025
244ec97
ADLR/megatron-lm!2468 - Handle 1d flatten shard-tensor edge-case
shjwudp Jan 22, 2025
0d59157
Merge branch 'fix_1d_flatten_tensor' into 'main'
ko3n1g Jan 22, 2025
46fdcd5
ADLR/megatron-lm!2577 - ci: Enable checks for `mem-max-allocated-bytes`
ko3n1g Jan 23, 2025
9e8690f
Merge branch 'ko3n1g/ci/add-max-memory-to-checks' into 'main'
ko3n1g Jan 23, 2025
a407351
ADLR/megatron-lm!2567 - Add prep_batch_for_inference_input function a…
santhnm2 Jan 23, 2025
2167226
Merge branch 'inference_api_simplifications' into 'main'
jaredcasper Jan 23, 2025
09c20f6
ADLR/megatron-lm!2485 - Refactor transformer layer offset to public
yanring Jan 25, 2025
127ef26
Merge branch 'zijiey/new_get_layer_offset' into 'main'
ericharper Jan 25, 2025
3750d21
ADLR/megatron-lm!2161 - Update converter docs
lmcafee-nvidia Jan 25, 2025
f960d4d
Merge branch 'lmcafee/converter-docs-sep24' into 'main'
ericharper Jan 25, 2025
0a43540
ADLR/megatron-lm!2358 - Add repr for parallel linear layers
akoumpa Jan 26, 2025
d57d110
Merge branch 'akoumparouli/add_repr_for_parallel_linear' into 'main'
ericharper Jan 26, 2025
9ad69c6
ADLR/megatron-lm!2595 - build: Fix LTS container
ko3n1g Jan 27, 2025
aa6081d
Merge branch 'ko3n1g/build/fix-lts-container' into 'main'
ko3n1g Jan 27, 2025
9fe4ea7
ADLR/megatron-lm!2498 - Add TP Support for Sequence Auxiliary Loss
xxuwenc Jan 27, 2025
0e85db5
Merge branch 'seq_aux_loss_tp' into 'main'
ko3n1g Jan 27, 2025
4b26b14
ADLR/megatron-lm!2597 - build: Pin triton
ko3n1g Jan 28, 2025
d26a384
Merge branch 'ko3n1g/build/pin-triton' into 'main'
ko3n1g Jan 28, 2025
2b25ad9
ADLR/megatron-lm!2596 - build: Use Python 3.11
ko3n1g Jan 28, 2025
883f5fd
Merge branch 'ko3n1g/ci/build-test-py11' into 'main'
ko3n1g Jan 28, 2025
0689058
ADLR/megatron-lm!2587 - Add safeguard to fail if video is empty
Jan 28, 2025
5cf351f
Merge branch 'add_video_safeguard' into 'main'
jaredcasper Jan 28, 2025
ba8231f
ADLR/megatron-lm!2580 - Fix dataloader save state
Jan 28, 2025
3d1554d
Merge branch 'fix_dataloader_save' into 'main'
jaredcasper Jan 28, 2025
684facb
ADLR/megatron-lm!2531 - Enable CUDA graphs for MCore inference
santhnm2 Jan 28, 2025
d5069b8
Merge branch 'mcore_cuda_graph' into 'main'
jaredcasper Jan 28, 2025
ebf519a
ADLR/megatron-lm!2603 - build: Fix nemo image
ko3n1g Jan 29, 2025
fb591c7
Merge branch 'ko3n1g/build/nemo-image' into 'main'
ko3n1g Jan 29, 2025
cef5154
ADLR/megatron-lm!2217 - Finalize local checkpointing support
skierat Jan 30, 2025
2b4b680
Merge branch 'skierat/finalize_local_checkpointing' into 'main'
ericharper Jan 30, 2025
700ef21
ADLR/megatron-lm!2589 - Fix video training and eval
boxin-wbx Jan 30, 2025
bc12efb
Merge branch 'fix_video_eval' into 'main'
trintamaki Jan 30, 2025
b4076c7
ADLR/megatron-lm!2547 - Standardize NCCL option passing in Megatron Core
afarjallah-nv Jan 31, 2025
a4e2028
Merge branch 'afarjallah/netname' into 'main'
ko3n1g Jan 31, 2025
42a76b9
ADLR/megatron-lm!2610 - Update config for llama 3.1 8b vision projection
Jan 31, 2025
b187295
Merge branch 'matthieul/vision_proj_config_llama_3p1_8b' into 'main'
trintamaki Jan 31, 2025
43738f9
ADLR/megatron-lm!2609 - Add wandb and onelogger to gitignore
Jan 31, 2025
96daf5f
Merge branch 'matthieul/wandb_to_gitignore' into 'main'
jaredcasper Jan 31, 2025
7274b83
ADLR/megatron-lm!2613 - Add loss scaling to mmodal
Jan 31, 2025
dbe8fa0
Merge branch 'matthieul/add_loss_scaling' into 'main'
trintamaki Jan 31, 2025
8c98d2d
ADLR/megatron-lm!2551 - llama3.2 support
trintamaki Feb 1, 2025
4d4621b
Merge branch 'trintamaki/llama3.2-support' into 'main'
trintamaki Feb 1, 2025
fec8601
ADLR/megatron-lm!2528 - barebones RADIO implementation
Feb 1, 2025
9cfad3d
Merge branch 'tpoon/mcore_only_radio_mr' into 'main'
trintamaki Feb 1, 2025
db9527f
ADLR/megatron-lm!2392 - Add streaming support for MCore inference API
santhnm2 Feb 1, 2025
5d1cc3e
Merge branch 'streaming' into 'main'
deepakn94 Feb 1, 2025
85676ca
ADLR/megatron-lm!2593 - Fix blended dataset oversampling, remove reno…
Feb 1, 2025
6356152
Merge branch 'no-blend-renormalization' into 'main'
ericharper Feb 1, 2025
d0a0c78
ADLR/megatron-lm!2530 - Deprecate unused checkpointing module
mikolajblaz Feb 1, 2025
731fbfd
Merge branch 'mblaz/deprecate-two-stage' into 'main'
jaredcasper Feb 1, 2025
6e84ec7
ADLR/megatron-lm!2504 - Simplify CP in LLaVA and move packed sequence…
parthmannan Feb 1, 2025
04f9344
Merge branch 'pmannan/llava_cp' into 'main'
ko3n1g Feb 1, 2025
4d4676e
ADLR/megatron-lm!2328 - Add support of TP for MLA
Shunkangz Feb 2, 2025
ea94163
Merge branch 'MLA_TP' into 'main'
ko3n1g Feb 2, 2025
6508404
ADLR/megatron-lm!2614 - Fix pipeline parallelism bugs in MCore inference
santhnm2 Feb 2, 2025
eedb2fe
Merge branch 'inference_pipeline_parallelism_fix' into 'main'
deepakn94 Feb 2, 2025
68589ec
ADLR/megatron-lm!2529 - Fix distributed checkpointing for fp8 padding…
yaox12 Feb 3, 2025
3366815
Merge branch 'xiny/fix_dist_ckpt_for_fp8_padding' into 'main'
ko3n1g Feb 3, 2025
6016692
ADLR/megatron-lm!2618 - test: Update nightly values
ko3n1g Feb 3, 2025
2a9793d
Merge branch 'ko3n1g/ci/update-nightly-values' into 'main'
ko3n1g Feb 3, 2025
5d609e4
ADLR/megatron-lm!2517 - Reuse global metadata for first saves
Feb 3, 2025
53634e9
Merge branch 'saharon/reuse_global_metadata_for_first_saves' into 'main'
ko3n1g Feb 3, 2025
4a156cb
ADLR/megatron-lm!2497 - Added CP support for partial DistOpt
sanandaraj5597 Feb 3, 2025
284ed81
Merge branch 'partial_distopt_with_cp' into 'main'
jaredcasper Feb 3, 2025
0ed0f70
ADLR/megatron-lm!2560 - Bring in-job restart up-to-date with latest N…
jbieniusiewi Feb 4, 2025
4727616
Merge branch 'fault_tolerance_v03' into 'main'
deepakn94 Feb 4, 2025
ef49083
ADLR/megatron-lm!1961 - Uneven Virtual Pipeline Parallelism
Shunkangz Feb 4, 2025
05949f1
Merge branch 'uneven_vpp' into 'main'
ericharper Feb 4, 2025
c6e3b0c
ADLR/megatron-lm!2026 - Add aux loss free routing.
Victarry Feb 4, 2025
c045c05
Merge branch 'denliu/aux-free-routing' into 'main'
ericharper Feb 4, 2025
8d6c9eb
ADLR/megatron-lm!2417 - Broadcast sharded objects during fully parall…
Feb 4, 2025
6e211f4
Merge branch 'saharon/broadcast_sharded_objects_fully_parallel_load' …
ericharper Feb 4, 2025
3e7ceda
ADLR/megatron-lm!2586 - Support CP + EP with DP last rank ordering
ryantwolf Feb 4, 2025
ca46c53
Merge branch 'rywolf/dp-last' into 'main'
ericharper Feb 4, 2025
10654a4
ADLR/megatron-lm!2635 - ci: update nightly values
ko3n1g Feb 5, 2025
550512a
Merge branch 'ko3n1g/ci/fix-nightlies' into 'main'
ko3n1g Feb 5, 2025
b01ae5f
ADLR/megatron-lm!2604 - Ensure CPU tensors are cloned
mikolajblaz Feb 5, 2025
1b4a0a8
Merge branch 'mblaz/ensure-cpu-clone' into 'main'
ericharper Feb 5, 2025
44d11cb
ADLR/megatron-lm!2636 - ci: Release results
ko3n1g Feb 5, 2025
0ae1d14
Merge branch 'ko3n1g/ci/release-0.10' into 'main'
ko3n1g Feb 5, 2025
d41666d
ADLR/megatron-lm!2503 - Cudagraphable RNG and cudagraph memory fixes
jiemingz Feb 6, 2025
3b9035c
Merge branch 'cudagraph_single_mempool' into 'main'
ko3n1g Feb 6, 2025
f575d3f
ADLR/megatron-lm!2527 - Support MCore MambaModel quantization through…
ChenhanYu Feb 6, 2025
0dd78dd
Merge branch 'chenhany/mamba_modelopt_support' into 'main'
ko3n1g Feb 6, 2025
6213cff
ADLR/megatron-lm!2508 - Disable the FP8 transpose cache when using to…
youngeunkwon0405 Feb 6, 2025
6219d96
Merge branch 'fsdp2_fp8_cache' into 'main'
ericharper Feb 6, 2025
c5d8bfd
ADLR/megatron-lm!2445 - Port multimodal inference to MCore API
santhnm2 Feb 7, 2025
a200b93
Merge branch 'multimodal_mcore_inference' into 'main'
ericharper Feb 7, 2025
8a85c58
ADLR/megatron-lm!2638 - Add a flag to InferenceParams for indicating …
santhnm2 Feb 7, 2025
bcee052
Merge branch 'decode_mode' into 'main'
jaredcasper Feb 7, 2025
e754cf7
ADLR/megatron-lm!2631 - Add offline packing and offline target ratio …
Feb 9, 2025
80f22b7
Merge branch 'matthieul/offline_packing' into 'main'
trintamaki Feb 9, 2025
5a305d2
ADLR/megatron-lm!2647 - Revert to API compatibility with old textgen …
mathemakitten Feb 9, 2025
5d7575d
Merge branch 'helenn-textgen-server-fixes' into 'main'
ericharper Feb 9, 2025
b1022a3
ADLR/megatron-lm!2521 - Support Node-Limited Routing for DeepSeek-V3
xxuwenc Feb 9, 2025
cd4a391
Merge branch 'node_limited_routing' into 'main'
ko3n1g Feb 9, 2025
4d00edb
ADLR/megatron-lm!2654 - ci: Do not print logs for release tests
ko3n1g Feb 9, 2025
2481987
Merge branch 'ko3n1g/ci/improve-release-tests' into 'main'
ko3n1g Feb 9, 2025
8a71e3b
ADLR/megatron-lm!2224 - MoE permute/unpermute fusion
hxbai Feb 10, 2025
044e2ad
Merge branch 'hongxiaob/permute_fusion' into 'main'
ko3n1g Feb 10, 2025
062681d
ADLR/megatron-lm!2606 - VLM FSDP workaround
trintamaki Feb 10, 2025
7debdd5
Merge branch 'trintamaki/vlm-fsdp-workaround' into 'main'
trintamaki Feb 10, 2025
6eef69a
ADLR/megatron-lm!2620 - Feature/wandb load checkpoint
talorabr Feb 10, 2025
5b8043e
Merge branch 'feature/wandb_load_checkpoint' into 'main'
jon-barker Feb 10, 2025
313706b
ADLR/megatron-lm!2662 - Bug fixes for MCore inference
santhnm2 Feb 11, 2025
26ad9b3
Merge branch 'inference_bug_fixes' into 'main'
ko3n1g Feb 11, 2025
fb661ce
ADLR/megatron-lm!2660 - ci: Exit code unit tests
ko3n1g Feb 11, 2025
6be9815
Merge branch 'ko3n1g/ci/exit-code-unit-tests' into 'main'
ko3n1g Feb 11, 2025
20571df
ADLR/megatron-lm!2665 - ci: Re-enable legacy test suite
ko3n1g Feb 11, 2025
bcd2934
Merge branch 'ko3n1g/ci/legacy-suite' into 'main'
ko3n1g Feb 11, 2025
0b4bd8e
ADLR/megatron-lm!2573 - Fix MLA breakage and MLA inference
Shunkangz Feb 11, 2025
b5836f8
Merge branch 'MLA_Fix' into 'main'
ko3n1g Feb 11, 2025
f2ece18
ADLR/megatron-lm!2611 - Added param remainder switch
sanandaraj5597 Feb 11, 2025
749dbb0
Merge branch 'adam_param_remainder' into 'main'
ko3n1g Feb 11, 2025
8f816d4
ADLR/megatron-lm!2458 - [dist ckpt] Remove alias LocalNonpersitentObject
ananthsub Feb 11, 2025
f2f8101
Merge branch 'remove-persistent-alias-ckpt' into 'main'
ericharper Feb 11, 2025
54e1db0
ADLR/megatron-lm!2623 - [dist ckpt] Resolve todos in `_split_by_size_…
ananthsub Feb 11, 2025
79e9894
Merge branch 'ckpt-split-by-size-type' into 'main'
ko3n1g Feb 11, 2025
1878be3
ADLR/megatron-lm!2651 - Embedder fix for radio
Feb 11, 2025
bfd1840
Merge branch 'tpoon/radio_fix_mr' into 'main'
trintamaki Feb 11, 2025
77e3593
ADLR/megatron-lm!2670 - tests: test_builder
ko3n1g Feb 11, 2025
aa719a0
Merge branch 'ko3n1g/tests/data-builder' into 'main'
ko3n1g Feb 11, 2025
68b6119
ADLR/megatron-lm!2669 - Llava unit test fix
trintamaki Feb 11, 2025
55cdfc1
Merge branch 'trintamaki/llava-unit-test-fix' into 'main'
ko3n1g Feb 11, 2025
850ac6d
ADLR/megatron-lm!2667 - Warn instead of error when model_opt is enabl…
skierat Feb 12, 2025
eb7092e
Merge branch 'skierat/local_vs_model_opt' into 'main'
ko3n1g Feb 12, 2025
7e748bf
ADLR/megatron-lm!2633 - Reduce NCCL memory cost in UT
Victarry Feb 12, 2025
8ca9e57
Merge branch 'denliu/reduce_ut_memory' into 'main'
ko3n1g Feb 12, 2025
50d8475
ADLR/megatron-lm!2116 - Enabling UCC backend for PP communication
youngeunkwon0405 Feb 13, 2025
5b47af6
Merge branch 'ucc_work' into 'main'
jaredcasper Feb 13, 2025
f8ed25c
ADLR/megatron-lm!2632 - Fix for Frozen QK LayerNorm when training VLM…
parthmannan Feb 13, 2025
ac3884a
Merge branch 'pmannan/fix_qk_ln_freeze' into 'main'
ko3n1g Feb 13, 2025
09e76b9
ADLR/megatron-lm!2680 - chore: Bump version
ko3n1g Feb 13, 2025
5575cfc
Merge branch 'ko3n1g/chore/bump' into 'main'
ericharper Feb 13, 2025
7c2239a
ADLR/megatron-lm!2561 - Basic context and sequence parallel support i…
trintamaki Feb 13, 2025
78fc935
Merge branch 'trintamaki/vlm-example-cp-sp' into 'main'
trintamaki Feb 13, 2025
60007c9
ADLR/megatron-lm!2526 - Optimizer CPU offload support
shjwudp Feb 14, 2025
3364154
Merge branch 'optimizer_cpu_offload_poc' into 'main'
jaredcasper Feb 14, 2025
4052d61
ADLR/megatron-lm!2675 - Fix distributed checkpoint tests
skierat Feb 14, 2025
d6985c4
Merge branch 'skierat/fix_test_local' into 'main'
deepakn94 Feb 14, 2025
4d3d6b2
ADLR/megatron-lm!2641 - Statically allocate KV cache for MCore inference
santhnm2 Feb 15, 2025
6673956
Merge branch 'static_inference_params' into 'main'
jaredcasper Feb 15, 2025
4dc6b71
ADLR/megatron-lm!2659 - Various improvements to RerunStateMachine
Feb 15, 2025
9a496c9
Merge branch 'fix-backward-checkpoint' into 'main'
deepakn94 Feb 15, 2025
c1e71cc
ADLR/megatron-lm!2676 - Re-enable MoE flaky unit tests.
yanring Feb 17, 2025
fe7f28a
Merge branch 'zijiey/enable_moe_flaky_ut' into 'main'
ko3n1g Feb 17, 2025
81f3cd1
ADLR/megatron-lm!2679 - build: Guard NVRX
ko3n1g Feb 17, 2025
b06494c
Merge branch 'ko3n1g/build/guard-nvrx' into 'main'
ko3n1g Feb 17, 2025
a0430bf
ADLR/megatron-lm!2673 - Fix DDP over-param-gather issue when param or…
Victarry Feb 17, 2025
a0365bc
Merge branch 'denliu/fix_ddp_param_gather' into 'main'
ko3n1g Feb 17, 2025
34fa7b4
ADLR/megatron-lm!2664 - ci: Remove triton
ko3n1g Feb 17, 2025
7dfd00b
Merge branch 'ko3n1g/build/triton' into 'main'
ko3n1g Feb 17, 2025
b997545
ADLR/megatron-lm!2693 - ci: Read `package_info.py`
ko3n1g Feb 17, 2025
020cb6e
Merge branch 'ko3n1g/ci/package-info' into 'main'
ko3n1g Feb 17, 2025
96b1c07
ADLR/megatron-lm!2694 - docs: Add changelog
ko3n1g Feb 17, 2025
ae82b26
Merge branch 'ko3n1g/docs/changelog' into 'main'
ko3n1g Feb 17, 2025
86b157e
ADLR/megatron-lm!2682 - Guard against 'common_step'=None
maanug-nv Feb 18, 2025
677382e
Merge branch 'maanug/common-step-guard' into 'main'
ericharper Feb 18, 2025
a551421
ADLR/megatron-lm!2699 - ci: Set legacy suite
ko3n1g Feb 18, 2025
addeb0d
Merge branch 'ko3n1g/ci/legacy-suite' into 'main'
ko3n1g Feb 18, 2025
cbc9be6
ADLR/megatron-lm!2696 - ci: Fix release
ko3n1g Feb 19, 2025
3312b08
Merge branch 'ko3n1g/ci/fix-release' into 'main'
ko3n1g Feb 19, 2025
830a086
ADLR/megatron-lm!2697 - Fix the PP backend error in cpu-only case
youngeunkwon0405 Feb 19, 2025
61b2c4f
Merge branch 'fix_ucc_mr_for_cpu_only' into 'main'
ko3n1g Feb 19, 2025
c8780d5
ADLR/megatron-lm!2701 - Fix Distributed Checkpointing for Backward Co…
skierat Feb 19, 2025
e1586c2
Merge branch 'skierat/direct_args' into 'main'
ko3n1g Feb 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,8 @@ build
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
Loading