Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

abhijeet-dhumal · 2025-01-10T13:12:45Z

Description

Updated training script to utilise Multi-Node/Multi-GPU scenario properly

How Has This Been Tested?

Following tests are executed on a cluster having NVIDIA GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s

TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s

TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s

Following tests are executed on a cluster having AMD GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s

TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s

TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

abhijeet-dhumal · 2025-01-10T13:53:47Z

Multi-Node / Multi-GPU scenario verified :

tests/kfto/kfto_mnist_training_test.go

sutaakar · 2025-01-10T14:36:53Z

tests/kfto/kfto_mnist_training_test.go

 }

-func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) {
+func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) {


IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode to numProcPerNode and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode looks confusing to me as it is not clear what does it represent.

Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:

single-CPUs/GPUs per node

multi-CPUs/GPUs per node

Similar to the torchrun command's --nproc_per_node arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..

astefanutti · 2025-01-13T09:00:34Z

/lgtm

Great work!

astefanutti

Some leftover comments :)

astefanutti · 2025-01-13T10:10:24Z

tests/kfto/resources/mnist.py

+        backend="gloo"
+    dist.init_process_group(backend=backend)
+    # Wait for all nodes to join
+    dist.barrier()


Is it really necessary? I don't see it always used in some other examples.

Yeah right, Initially I saw that master pod completed it's execution even if worker pods were still running which might be because of bad setup of distributed environment.. then I read about this barrier function.

[rank1]: RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.

this is needed to keep all nodes in sync so that if master pod is ready, but worker pod is still installing requirements.. master pod will wait untill all /worker-nodes have reached to that certain point(in this case untill all nodes have initialised dist backend process) so they will block (or "wait") because they run into the barrier until all processes have reached a barrier.

Thanks, as you say this may be needed because of the pip install command we run that delays when the node is able to join. Otherwise this should not be needed.

Excerpted from the init_process_group code, which gives some hints:

# barrier at the end to ensure that once we return from this method, all # process groups including global variables (if any) are updated # correctly on all ranks. # Update 04/2023: for large-scale runs, this barrier (esp. store-based # barrier) may be costly and/or unscalable. Also, in a lot of cases, # these barriers may be unnecessary, as proven by a green CI after # removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been # added which enables this barrier only when set to 1.

tests/kfto/resources/mnist.py

sutaakar

/lgtm
good job

sutaakar · 2025-01-13T15:53:02Z

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

abhijeet-dhumal · 2025-01-15T06:11:47Z

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔

…enario using DDP example

…de concurrently by using pre-downloaded dataset

openshift-ci · 2025-01-17T10:52:06Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2025-01-17T10:52:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sutaakar. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sutaakar · 2025-01-17T13:44:19Z

tests/kfto/support.go

@@ -36,6 +36,7 @@ type Gpu struct {
 var (
 	NVIDIA = Gpu{ResourceLabel: "nvidia.com/gpu", PrometheusGpuUtilizationLabel: "DCGM_FI_DEV_GPU_UTIL"}
 	AMD    = Gpu{ResourceLabel: "amd.com/gpu"}
+	CPU    = Gpu{ResourceLabel: "cpu"}


AFAIK you don't have to provide ResourceLabel. For CPU it is not used, right?

Thinking whether Gpu should be replaced by Accelerator to describe its purpose better as CPU is included.

Which can have a function isGpu to be used to detect if it is GPU test.

So you can then replace gpu.ResourceLabel != "cpu" with accelerator.isGpu()

sutaakar · 2025-01-17T13:49:32Z

tests/kfto/kfto_mnist_training_test.go

@@ -59,7 +63,7 @@ func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLa
 	mnist := ReadFile(test, "resources/mnist.py")
 	requirementsFileName := ReadFile(test, requirementsFile)

-	if numGpus > 0 {
+	if workerReplicas*numProcPerNode > 0 && gpu.ResourceLabel != "cpu" {


Is it needed to check workerReplicas*numProcPerNode?

sutaakar · 2025-01-17T13:58:12Z

tests/kfto/kfto_mnist_training_test.go

 										pip install --no-cache-dir -r /mnt/files/requirements.txt --target=/tmp/lib && \
-										python /mnt/files/mnist.py --epochs 3 --save-model --output-path /mnt/output --backend %s`, backend),
+										echo "Downloading MNIST dataset..." && \


Nitpick - have you considered using init container for downloading the dataset?
Downloading dataset and running the tests can be seen as two phases.

abhijeet-dhumal requested a review from sutaakar January 10, 2025 13:12

openshift-ci bot added the do-not-merge/work-in-progress label Jan 10, 2025

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from bd6930b to 24405d8 Compare January 10, 2025 13:58

abhijeet-dhumal requested a review from astefanutti January 10, 2025 13:58

abhijeet-dhumal marked this pull request as ready for review January 10, 2025 13:59

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 10, 2025

openshift-ci bot requested review from KPostOffice and varshaprasad96 January 10, 2025 13:59

abhijeet-dhumal requested review from ChughShilpa and removed request for KPostOffice and varshaprasad96 January 10, 2025 13:59

astefanutti reviewed Jan 10, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

sutaakar reviewed Jan 10, 2025

View reviewed changes

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 24405d8 to b19447b Compare January 13, 2025 08:55

abhijeet-dhumal requested review from astefanutti and sutaakar January 13, 2025 08:55

astefanutti reviewed Jan 13, 2025

View reviewed changes

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from b19447b to e7edc7b Compare January 13, 2025 11:51

abhijeet-dhumal requested a review from astefanutti January 13, 2025 11:51

sutaakar reviewed Jan 13, 2025

View reviewed changes

openshift-ci bot assigned sutaakar Jan 13, 2025

openshift-ci bot added the lgtm label Jan 13, 2025

abhijeet-dhumal added 2 commits January 17, 2025 16:21

Update KFTO MNIST multi-node test script to add multi-gpu training sc…

42e8f17

…enario using DDP example

Update MNIST training script to avoid downloading datasets on each no…

c3f99cf

…de concurrently by using pre-downloaded dataset

abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from e7edc7b to c3f99cf Compare January 17, 2025 10:52

openshift-ci bot removed the lgtm label Jan 17, 2025

abhijeet-dhumal requested a review from sutaakar January 17, 2025 10:53

sutaakar reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

abhijeet-dhumal commented Jan 10, 2025 •

edited

Loading

abhijeet-dhumal commented Jan 10, 2025

sutaakar Jan 10, 2025

abhijeet-dhumal Jan 13, 2025 •

edited

Loading

astefanutti commented Jan 13, 2025

astefanutti left a comment

astefanutti Jan 13, 2025

abhijeet-dhumal Jan 13, 2025 •

edited

Loading

astefanutti Jan 13, 2025

sutaakar left a comment

sutaakar commented Jan 13, 2025

abhijeet-dhumal commented Jan 15, 2025

openshift-ci bot commented Jan 17, 2025

openshift-ci bot commented Jan 17, 2025

sutaakar Jan 17, 2025

sutaakar Jan 17, 2025

sutaakar Jan 17, 2025

sutaakar Jan 17, 2025

sutaakar Jan 17, 2025

sutaakar Jan 17, 2025

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Are you sure you want to change the base?

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Conversation

abhijeet-dhumal commented Jan 10, 2025 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

abhijeet-dhumal commented Jan 10, 2025

Choose a reason for hiding this comment

abhijeet-dhumal Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

astefanutti commented Jan 13, 2025

astefanutti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijeet-dhumal Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sutaakar left a comment

Choose a reason for hiding this comment

sutaakar commented Jan 13, 2025

abhijeet-dhumal commented Jan 15, 2025

openshift-ci bot commented Jan 17, 2025

openshift-ci bot commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhijeet-dhumal commented Jan 10, 2025 •

edited

Loading

abhijeet-dhumal Jan 13, 2025 •

edited

Loading

abhijeet-dhumal Jan 13, 2025 •

edited

Loading