Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KFTO MNIST muti-node/multi-gpu test to utilise multiple GPUs c… #301

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Jan 10, 2025

Description

  • Updated training script to utilise Multi-Node/Multi-GPU scenario properly

How Has This Been Tested?

image


Following tests are executed on a cluster having NVIDIA GPUs :

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 33s (time taken to execute)

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 36s

TestPyTorchJobMnistMultiNodeSingleGpuWithCuda - 2m 35s

TestPyTorchJobMnistMultiNodeMultiGpuWithCuda - 2m 16s


Following tests are executed on a cluster having AMD GPUs :

image

TestPyTorchJobMnistMultiNodeSingleCpu - 3m 45s

TestPyTorchJobMnistMultiNodeMultiCpu - 2m 42s

TestPyTorchJobMnistMultiNodeSingleGpuWithROCm - 4m 18s

TestPyTorchJobMnistMultiNodeMultiGpuWithROCm - 3m 8s


Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@abhijeet-dhumal
Copy link
Contributor Author

Multi-Node / Multi-GPU scenario verified :

image-7
image-8
image-9

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from bd6930b to 24405d8 Compare January 10, 2025 13:58
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review January 10, 2025 13:59
@abhijeet-dhumal abhijeet-dhumal requested review from ChughShilpa and removed request for KPostOffice and varshaprasad96 January 10, 2025 13:59
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
}

func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLabel string, image string, requirementsFile string) {
func runKFTOPyTorchMnistJob(t *testing.T, totalNumGpus int, workerReplicas int, numCPUsOrGPUsCountPerNode int, gpuLabel string, image string, requirementsFile string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would have more sense to rename numCPUsOrGPUsCountPerNode to numProcPerNode and keep CPU number hardcoded.
numCPUsOrGPUsCountPerNode looks confusing to me as it is not clear what does it represent.

Copy link
Contributor Author

@abhijeet-dhumal abhijeet-dhumal Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, by this variable I meant the number of devices(GPUs/CPUs) to be utilised per cluster-node.. but I agree that word framing was quite confusing 😅
By this approach I wanted to add test coverage for multi-node's use cases:

  1. single-CPUs/GPUs per node
  2. multi-CPUs/GPUs per node

Similar to the torchrun command's --nproc_per_node arg which allows to specify number of devices to be utilised per node, whether it may be number of CPUs or GPUs..

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from 24405d8 to b19447b Compare January 13, 2025 08:55
@astefanutti
Copy link
Contributor

/lgtm

Great work!

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some leftover comments :)

backend="gloo"
dist.init_process_group(backend=backend)
# Wait for all nodes to join
dist.barrier()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary? I don't see it always used in some other examples.

Copy link
Contributor Author

@abhijeet-dhumal abhijeet-dhumal Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah right, Initially I saw that master pod completed it's execution even if worker pods were still running which might be because of bad setup of distributed environment.. then I read about this barrier function.

[rank1]: RuntimeError: Rank 1 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank.

this is needed to keep all nodes in sync so that if master pod is ready, but worker pod is still installing requirements.. master pod will wait untill all /worker-nodes have reached to that certain point(in this case untill all nodes have initialised dist backend process) so they will block (or "wait") because they run into the barrier until all processes have reached a barrier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as you say this may be needed because of the pip install command we run that delays when the node is able to join. Otherwise this should not be needed.

Excerpted from the init_process_group code, which gives some hints:

# barrier at the end to ensure that once we return from this method, all
# process groups including global variables (if any) are updated
# correctly on all ranks.
# Update 04/2023: for large-scale runs, this barrier (esp. store-based
# barrier) may be costly and/or unscalable. Also, in a lot of cases,
# these barriers may be unnecessary, as proven by a green CI after
# removal. An environment variable `TORCH_DIST_INIT_BARRIER` has been
# added which enables this barrier only when set to 1.

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from b19447b to e7edc7b Compare January 13, 2025 11:51
Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
good job

@sutaakar
Copy link
Contributor

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

@abhijeet-dhumal
Copy link
Contributor Author

@abhijeet-dhumal I ran TestPyTorchJobMnistMultiNodeMultiGpuWithROCm and got this error:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/mnt/files/mnist.py", line 175, in <module>
[rank2]:     main(
[rank2]:   File "/mnt/files/mnist.py", line 157, in main
[rank2]:     dataset, model, optimizer = load_train_objs(lr)
[rank2]:                                 ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/mnt/files/mnist.py", line 135, in load_train_objs
[rank2]:     train_set = torchvision.datasets.MNIST("../data",
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 100, in __init__
[rank2]:     self.download()
[rank2]:   File "/tmp/lib/torchvision/datasets/mnist.py", line 188, in download
[rank2]:     download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 395, in download_and_extract_archive
[rank2]:     download_url(url, download_root, filename, md5)
[rank2]:   File "/tmp/lib/torchvision/datasets/utils.py", line 143, in download_url
[rank2]:     raise RuntimeError("File not found or corrupted.")
[rank2]: RuntimeError: File not found or corrupted.

I guess the issue is caused by concurrent downloading of dataset?

I couldn't reliably reproduce this error but have seen it before, I think you're right each process is trying to download dataset concurrently this can be avoided by ensuring that only one process downloads the dataset.. for example rank 0 process, and all other processes should wait until the download is complete. .. 🤔

@abhijeet-dhumal abhijeet-dhumal force-pushed the update-kfto-multinode-multigpu-test branch from e7edc7b to c3f99cf Compare January 17, 2025 10:52
@openshift-ci openshift-ci bot removed the lgtm label Jan 17, 2025
Copy link

openshift-ci bot commented Jan 17, 2025

New changes are detected. LGTM label has been removed.

Copy link

openshift-ci bot commented Jan 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from sutaakar. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -36,6 +36,7 @@ type Gpu struct {
var (
NVIDIA = Gpu{ResourceLabel: "nvidia.com/gpu", PrometheusGpuUtilizationLabel: "DCGM_FI_DEV_GPU_UTIL"}
AMD = Gpu{ResourceLabel: "amd.com/gpu"}
CPU = Gpu{ResourceLabel: "cpu"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK you don't have to provide ResourceLabel. For CPU it is not used, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking whether Gpu should be replaced by Accelerator to describe its purpose better as CPU is included.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which can have a function isGpu to be used to detect if it is GPU test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you can then replace gpu.ResourceLabel != "cpu" with accelerator.isGpu()

@@ -59,7 +63,7 @@ func runKFTOPyTorchMnistJob(t *testing.T, numGpus int, workerReplicas int, gpuLa
mnist := ReadFile(test, "resources/mnist.py")
requirementsFileName := ReadFile(test, requirementsFile)

if numGpus > 0 {
if workerReplicas*numProcPerNode > 0 && gpu.ResourceLabel != "cpu" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to check workerReplicas*numProcPerNode?

pip install --no-cache-dir -r /mnt/files/requirements.txt --target=/tmp/lib && \
python /mnt/files/mnist.py --epochs 3 --save-model --output-path /mnt/output --backend %s`, backend),
echo "Downloading MNIST dataset..." && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - have you considered using init container for downloading the dataset?
Downloading dataset and running the tests can be seen as two phases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants