Skip to content

Actions: microsoft/DeepSpeed

nv-torch-latest-v100

Actions

Loading...
Loading

Show workflow options

Create status badge

Loading
5,085 workflow runs
5,085 workflow runs

Filter by Event

Filter by Status

Filter by Branch

Filter by Actor

Inference ops unit test failures/fixes
nv-torch-latest-v100 #12666: Pull request #6879 synchronize by loadams
December 17, 2024 18:00 28m 49s loadams/inference-ops-test-repro
December 17, 2024 18:00 28m 49s
[inf] Add config var to enable keeping module on host
nv-torch-latest-v100 #12664: Pull request #6846 synchronize by oelayan7
December 17, 2024 07:46 6h 0m 26s oelayan7:keep_module_on_host
December 17, 2024 07:46 6h 0m 26s
[inf] Add config var to enable keeping module on host
nv-torch-latest-v100 #12663: Pull request #6846 synchronize by oelayan7
December 17, 2024 07:39 Action required oelayan7:keep_module_on_host
December 17, 2024 07:39 Action required
Fix error caused by all_reduce call in domino
nv-torch-latest-v100 #12662: Pull request #6880 synchronize by hwchen2017
December 17, 2024 01:46 2h 8m 10s hongwei/fix_domino_allreduce
December 17, 2024 01:46 2h 8m 10s
Add arctic model support by adding w2 to all_reduce
nv-torch-latest-v100 #12661: Pull request #6856 synchronize by tjruwase
December 17, 2024 01:35 1h 52m 27s pi314ever:arctic-enabling-upstream
December 17, 2024 01:35 1h 52m 27s
nv-torch-latest-v100
nv-torch-latest-v100 #12659: Scheduled
December 17, 2024 00:22 2h 8m 5s master
December 17, 2024 00:22 2h 8m 5s
Fix checkpointable_layers Logic
nv-torch-latest-v100 #12658: Pull request #6881 opened by Quentin-Anthony
December 17, 2024 00:11 1h 19m 28s Quentin-Anthony:qanthony/fix-act-recomp
December 17, 2024 00:11 1h 19m 28s
Fix error caused by all_reduce call in domino
nv-torch-latest-v100 #12657: Pull request #6880 synchronize by hwchen2017
December 16, 2024 23:50 1h 57m 0s hongwei/fix_domino_allreduce
December 16, 2024 23:50 1h 57m 0s
Fix error caused by all_reduce call in domino
nv-torch-latest-v100 #12656: Pull request #6880 opened by hwchen2017
December 16, 2024 23:47 2m 45s hongwei/fix_domino_allreduce
December 16, 2024 23:47 2m 45s
Inference ops unit test failures/fixes
nv-torch-latest-v100 #12655: Pull request #6879 opened by loadams
December 16, 2024 23:08 43m 32s loadams/inference-ops-test-repro
December 16, 2024 23:08 43m 32s
Zero2: avoid graph breaks in torch.compile by using param_idx
nv-torch-latest-v100 #12654: Pull request #6803 synchronize by loadams
December 16, 2024 22:52 1h 17m 53s nelyahu:zero2_param_idx
December 16, 2024 22:52 1h 17m 53s
Fix --enable_each_rank_log when used with PDSH multi-node runner
nv-torch-latest-v100 #12653: Pull request #6863 synchronize by loadams
December 16, 2024 22:49 1h 40m 38s akeshet:akeshet/pdsh_rank_log
December 16, 2024 22:49 1h 40m 38s
Add the missing view operations from sequence parallel(async).
nv-torch-latest-v100 #12652: Pull request #6750 synchronize by loadams
December 16, 2024 22:49 6h 4m 29s inkcherry:ds_overlap_fix
December 16, 2024 22:49 6h 4m 29s
Zero2: avoid graph breaks in torch.compile by using param_idx
nv-torch-latest-v100 #12651: Pull request #6803 synchronize by loadams
December 16, 2024 22:15 5m 59s nelyahu:zero2_param_idx
December 16, 2024 22:15 5m 59s
Fix --enable_each_rank_log when used with PDSH multi-node runner
nv-torch-latest-v100 #12650: Pull request #6863 synchronize by loadams
December 16, 2024 21:28 1h 20m 36s akeshet:akeshet/pdsh_rank_log
December 16, 2024 21:28 1h 20m 36s
Fix: forbid repeated deepspeed.initialize on training objects
nv-torch-latest-v100 #12649: Pull request #6874 synchronize by traincheck-team
December 16, 2024 21:02 Action required traincheck-team:fix-6848-forbid-repeated-init
December 16, 2024 21:02 Action required
Fix: forbid repeated deepspeed.initialize on training objects
nv-torch-latest-v100 #12648: Pull request #6874 synchronize by traincheck-team
December 16, 2024 20:59 Action required traincheck-team:fix-6848-forbid-repeated-init
December 16, 2024 20:59 Action required
Support pure meta model lm_head tp
nv-torch-latest-v100 #12647: Pull request #6812 synchronize by loadams
December 16, 2024 19:34 1h 32m 29s Yejing-Lai:lyj/lm_head_replace
December 16, 2024 19:34 1h 32m 29s
Add MLP/lm_head tp grain size setting.
nv-torch-latest-v100 #12646: Pull request #6828 synchronize by loadams
December 16, 2024 19:33 1h 30m 11s Yejing-Lai:lyj/tp_grain_size
December 16, 2024 19:33 1h 30m 11s
Add the missing view operations from sequence parallel(async).
nv-torch-latest-v100 #12645: Pull request #6750 synchronize by loadams
December 16, 2024 19:33 2h 1m 6s inkcherry:ds_overlap_fix
December 16, 2024 19:33 2h 1m 6s
Fix --enable_each_rank_log when used with PDSH multi-node runner
nv-torch-latest-v100 #12644: Pull request #6863 synchronize by loadams
December 16, 2024 19:06 2h 8m 6s akeshet:akeshet/pdsh_rank_log
December 16, 2024 19:06 2h 8m 6s