Add MNIST test to run multi-node distributed training using KFTO #295

abhijeet-dhumal · 2024-12-17T16:08:25Z

Description

Added MNIST test to run KFTO Multi-node distributed training using NVIDIA(CUDA) and AMD GPUs(ROCm)
RHOAIENG-14540
RHOAIENG-14541

How Has This Been Tested?

export NOTEBOOK_IMAGE=quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73 \
CODEFLARE_TEST_RAY_IMAGE=quay.io/rhoai/ray@sha256:b0e129cd2f4cdea7ad7a7859031357ffd9915410551f94fbcb942af2198cdf78    # CUDA image \
FMS_HF_TUNING_IMAGE=quay.io/modh/fms-hf-tuning@sha256:6f98907f9095db72932caa54094438eae742145f4b66c28d15887d5303ff1186

For running test using AMD GPUs, use ROCM image : quay.io/modh/ray@sha256:db667df1bc437a7b0965e8031e905d3ab04b86390d764d120e05ea5a5c18d1b4

go test -v ./tests/kfto -run TestPyTorchJobMnistCpu
go test -v ./tests/kfto -run TestPyTorchJobMnistWithROCm
go test -v ./tests/kfto -run TestPyTorchJobMnistWithCuda

Tested manually :

Using CPU : Took 5 mins approx.
Using NVIDIA GPUs : Took 3 mins approx.
Using AMD GPUs : Took 3 mins approx.

Note: Usually takes more time to pull training image initially

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

tests/kfto/kfto_mnist_training_test.go

tests/kfto/resources/requirements-rocm.txt

tests/kfto/resources/requirements.txt

tests/kfto/resources/requirements-rocm.txt

tests/kfto/kfto_mnist_training_test.go

ChughShilpa · 2024-12-18T08:11:24Z

Is there any specific reason not to use the existing hf_llm_training.py script which is used in single node KFTO tests.

I think test coverage for multi nodes can be done by parameterizing the existing tests to run on multiple nodes.

abhijeet-dhumal · 2024-12-18T10:31:59Z

Is there any specific reason not to use the existing hf_llm_training.py script which is used in single node KFTO tests.

I think test coverage for multi nodes can be done by parameterizing the existing tests to run on multiple nodes.

@ChughShilpa
Actually there isn't any specific reason,
when it came to multi-node training using KFTO first priority came to my mind was to test existing training-operator examples created for the same : https://github.com/opendatahub-io/training-operator/tree/dev/examples
The pytorch based MNIST evxample was already documented a bit, and I wanted to explore more about the backend types used for torch.distributed

…VIDIA-CUDA/AMD-ROCm GPUs

abhijeet-dhumal · 2024-12-19T09:42:19Z

Updated pip package installation logic to avoid OSError: Permission error while writing/storing package cache on read-only path`
Updated backend type to 'NCCL' in case of running MNIST training using CUDA/ROCm training images
Tested and Verified using AMD & NVIDIA GPUs

kfto-mnist-vwcfj-master-0-pytorch.log
kfto-mnist-vwcfj-worker-0-pytorch (3).log

tests/kfto/resources/mnist.py

tests/kfto/kfto_mnist_training_test.go

…s on different pods and to storage output model using PersistentVolumeClaim with RWX access mode

tests/kfto/kfto_mnist_training_test.go

tests/kfto/resources/mnist.py

tests/kfto/kfto_mnist_training_test.go

…s on different cluster GPU/CPU nodes

sutaakar

/lgtm

tests/kfto/kfto_mnist_training_test.go

ChughShilpa · 2025-01-02T11:28:34Z

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

abhijeet-dhumal · 2025-01-02T11:32:06Z

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

@ChughShilpa
For the test, 1 epoch should be enough right?
Because as we increase number of epoch it will just increase number of training iteration by increasing time of execution.. and won't effect outcome of the test as base functionality is already being tested
If it is necessary in any case then I'm ok with increasing epoch count 👍

ChughShilpa · 2025-01-02T11:58:09Z

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

@ChughShilpa For the test, 1 epoch should be enough right? Because as we increase number of epoch it will just increase number of training iteration by increasing time of execution.. and won't effect outcome of the test as base functionality is already being tested If it is necessary in any case then I'm ok with increasing epoch count 👍

I think setting higher epochs provides benefits in identify resource issues or memory leaks that may arise during prolonged training. Also training the model in a more realistic scenario.

We are testing with at least 3 epochs for other tests so far (consistency point of view, that can be ignored, just my preference)
Number of Iteration depends on the batch size not the epochs.
Yes, higher epochs increase the execution time, since MNIST is a lightweight, so it will not take much higher time.

abhijeet-dhumal · 2025-01-03T06:47:53Z

@ChughShilpa @sutaakar I have tested all below cases:-

Training with:

workerReplicas-1, epoch-1 --> 2m 28s
workerReplicas-2, epoch-1 --> 2m 49s
workerReplicas-2, epoch-2 --> 3m 5s

I will update this test accordingly, Thanks !

sutaakar · 2025-01-03T08:04:51Z

/lgtm

abhijeet-dhumal · 2025-01-03T09:22:52Z

@sutaakar @ChughShilpa

CPU (Backend type - GLOO) : (WorkerReplicas-2, Epochs-3)
TestPyTorchJobMnistCpu - 3m 2s

GPU (backend type - NCCL) : (WorkerReplicas-1, Epochs-3)
TestPytorchjobMnistWithROCm - 3m 41s
TestPytorchjobMnistWithCuda - 2m 27s

Note : Initially incase of pulling training image it takes approx 10/12mins, in image below it took 14m 58s to run TestPytorchjobMnistWithCuda test

…to resolve fsspec/numpy package compatibility issue and added license in MNIST script

ChughShilpa

/lgtm

openshift-ci · 2025-01-03T12:02:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChughShilpa, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sutaakar]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhijeet-dhumal requested review from astefanutti, sutaakar and ChughShilpa December 17, 2024 16:08

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from 1c0a831 to d2e2371 Compare December 17, 2024 16:17

astefanutti reviewed Dec 17, 2024

View reviewed changes

abhijeet-dhumal marked this pull request as draft December 18, 2024 10:43

openshift-ci bot added the do-not-merge/work-in-progress label Dec 18, 2024

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch 2 times, most recently from 5f7ff0f to c095436 Compare December 19, 2024 09:32

Add MNIST test to run multi-node distributed training using CPUs or N…

9ae3ffd

…VIDIA-CUDA/AMD-ROCm GPUs

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from c095436 to 9ae3ffd Compare December 19, 2024 09:34

abhijeet-dhumal marked this pull request as ready for review December 19, 2024 09:34

openshift-ci bot removed the do-not-merge/work-in-progress label Dec 19, 2024

openshift-ci bot requested review from Bobbins228 and jiripetrlik December 19, 2024 09:34

abhijeet-dhumal requested review from astefanutti and removed request for jiripetrlik and Bobbins228 December 19, 2024 09:42

astefanutti reviewed Dec 19, 2024

View reviewed changes

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved

sutaakar reviewed Dec 19, 2024

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

updated MNIST test to use antipod-affinity for scheduling worker node…

c5686c8

…s on different pods and to storage output model using PersistentVolumeClaim with RWX access mode

abhijeet-dhumal requested review from astefanutti and sutaakar December 19, 2024 13:46

astefanutti reviewed Dec 19, 2024

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

astefanutti reviewed Dec 20, 2024

View reviewed changes

tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved

tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved

Add inter-pod anti-affinity rule in master node spec to schedule node…

4b92b1d

…s on different cluster GPU/CPU nodes

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from a325059 to 4b92b1d Compare December 20, 2024 09:55

abhijeet-dhumal requested a review from sutaakar January 2, 2025 08:43

sutaakar reviewed Jan 2, 2025

View reviewed changes

openshift-ci bot assigned sutaakar Jan 2, 2025

openshift-ci bot added the lgtm label Jan 2, 2025

ChughShilpa reviewed Jan 2, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from 5674d65 to 1c36d60 Compare January 2, 2025 10:43

openshift-ci bot removed the lgtm label Jan 2, 2025

abhijeet-dhumal requested a review from ChughShilpa January 2, 2025 10:43

ChughShilpa reviewed Jan 2, 2025

View reviewed changes

tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from 1c36d60 to afdf49b Compare January 3, 2025 06:53

abhijeet-dhumal requested review from ChughShilpa and sutaakar January 3, 2025 06:53

openshift-ci bot added the lgtm label Jan 3, 2025

Added resources list for CPU based usecase, updated requirements.txt …

7dbfdfc

…to resolve fsspec/numpy package compatibility issue and added license in MNIST script

abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from afdf49b to 7dbfdfc Compare January 3, 2025 10:54

openshift-ci bot removed the lgtm label Jan 3, 2025

ChughShilpa approved these changes Jan 3, 2025

View reviewed changes

openshift-ci bot assigned ChughShilpa Jan 3, 2025

openshift-ci bot added the lgtm label Jan 3, 2025

sutaakar approved these changes Jan 3, 2025

View reviewed changes

openshift-ci bot added the approved label Jan 3, 2025

abhijeet-dhumal removed the request for review from astefanutti January 3, 2025 12:04

openshift-merge-bot bot merged commit a5a6d39 into opendatahub-io:main Jan 3, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MNIST test to run multi-node distributed training using KFTO #295

Add MNIST test to run multi-node distributed training using KFTO #295

abhijeet-dhumal commented Dec 17, 2024 •

edited

Loading

ChughShilpa commented Dec 18, 2024 •

edited

Loading

abhijeet-dhumal commented Dec 18, 2024 •

edited

Loading

abhijeet-dhumal commented Dec 19, 2024 •

edited

Loading

sutaakar left a comment

ChughShilpa commented Jan 2, 2025

abhijeet-dhumal commented Jan 2, 2025 •

edited

Loading

ChughShilpa commented Jan 2, 2025

abhijeet-dhumal commented Jan 3, 2025 •

edited

Loading

sutaakar commented Jan 3, 2025

abhijeet-dhumal commented Jan 3, 2025

ChughShilpa left a comment

openshift-ci bot commented Jan 3, 2025

Add MNIST test to run multi-node distributed training using KFTO #295

Add MNIST test to run multi-node distributed training using KFTO #295

Conversation

abhijeet-dhumal commented Dec 17, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

ChughShilpa commented Dec 18, 2024 • edited Loading

abhijeet-dhumal commented Dec 18, 2024 • edited Loading

abhijeet-dhumal commented Dec 19, 2024 • edited Loading

sutaakar left a comment

Choose a reason for hiding this comment

ChughShilpa commented Jan 2, 2025

abhijeet-dhumal commented Jan 2, 2025 • edited Loading

ChughShilpa commented Jan 2, 2025

abhijeet-dhumal commented Jan 3, 2025 • edited Loading

sutaakar commented Jan 3, 2025

abhijeet-dhumal commented Jan 3, 2025

ChughShilpa left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 3, 2025

abhijeet-dhumal commented Dec 17, 2024 •

edited

Loading

ChughShilpa commented Dec 18, 2024 •

edited

Loading

abhijeet-dhumal commented Dec 18, 2024 •

edited

Loading

abhijeet-dhumal commented Dec 19, 2024 •

edited

Loading

abhijeet-dhumal commented Jan 2, 2025 •

edited

Loading

abhijeet-dhumal commented Jan 3, 2025 •

edited

Loading