Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MNIST test to run multi-node distributed training using KFTO #295

Conversation

abhijeet-dhumal
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal commented Dec 17, 2024

Description

Added MNIST test to run KFTO Multi-node distributed training using NVIDIA(CUDA) and AMD GPUs(ROCm)
RHOAIENG-14540
RHOAIENG-14541

How Has This Been Tested?

export NOTEBOOK_IMAGE=quay.io/modh/odh-generic-data-science-notebook@sha256:7c1a4ca213b71d342a2d1366171304e469da06d5f15710fab5dd3ce013aa1b73 \
CODEFLARE_TEST_RAY_IMAGE=quay.io/rhoai/ray@sha256:b0e129cd2f4cdea7ad7a7859031357ffd9915410551f94fbcb942af2198cdf78    # CUDA image \
FMS_HF_TUNING_IMAGE=quay.io/modh/fms-hf-tuning@sha256:6f98907f9095db72932caa54094438eae742145f4b66c28d15887d5303ff1186

For running test using AMD GPUs, use ROCM image : quay.io/modh/ray@sha256:db667df1bc437a7b0965e8031e905d3ab04b86390d764d120e05ea5a5c18d1b4

go test -v ./tests/kfto -run TestPyTorchJobMnistCpu
go test -v ./tests/kfto -run TestPyTorchJobMnistWithROCm
go test -v ./tests/kfto -run TestPyTorchJobMnistWithCuda

Tested manually :

  • Using CPU : Took 5 mins approx.
  • Using NVIDIA GPUs : Took 3 mins approx.
  • Using AMD GPUs : Took 3 mins approx.

Note: Usually takes more time to pull training image initially

image

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
tests/kfto/resources/requirements-rocm.txt Outdated Show resolved Hide resolved
tests/kfto/resources/requirements.txt Outdated Show resolved Hide resolved
tests/kfto/resources/requirements.txt Outdated Show resolved Hide resolved
tests/kfto/resources/requirements-rocm.txt Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
@ChughShilpa
Copy link
Contributor

ChughShilpa commented Dec 18, 2024

Is there any specific reason not to use the existing hf_llm_training.py script which is used in single node KFTO tests.

I think test coverage for multi nodes can be done by parameterizing the existing tests to run on multiple nodes.

@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Dec 18, 2024

Is there any specific reason not to use the existing hf_llm_training.py script which is used in single node KFTO tests.

I think test coverage for multi nodes can be done by parameterizing the existing tests to run on multiple nodes.

@ChughShilpa
Actually there isn't any specific reason,
when it came to multi-node training using KFTO first priority came to my mind was to test existing training-operator examples created for the same : https://github.com/opendatahub-io/training-operator/tree/dev/examples
The pytorch based MNIST evxample was already documented a bit, and I wanted to explore more about the backend types used for torch.distributed

@abhijeet-dhumal abhijeet-dhumal marked this pull request as draft December 18, 2024 10:43
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch 2 times, most recently from 5f7ff0f to c095436 Compare December 19, 2024 09:32
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from c095436 to 9ae3ffd Compare December 19, 2024 09:34
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review December 19, 2024 09:34
@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Dec 19, 2024

  • Updated pip package installation logic to avoid OSError: Permission error while writing/storing package cache on read-only path`
  • Updated backend type to 'NCCL' in case of running MNIST training using CUDA/ROCm training images
  • Tested and Verified using AMD & NVIDIA GPUs

kfto-mnist-vwcfj-master-0-pytorch.log
kfto-mnist-vwcfj-worker-0-pytorch (3).log

image (208)

@abhijeet-dhumal abhijeet-dhumal requested review from astefanutti and removed request for jiripetrlik and Bobbins228 December 19, 2024 09:42
tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
tests/kfto/resources/mnist.py Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Outdated Show resolved Hide resolved
tests/kfto/kfto_mnist_training_test.go Show resolved Hide resolved
…s on different pods and to storage output model using PersistentVolumeClaim with RWX access mode
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from a325059 to 4b92b1d Compare December 20, 2024 09:55
Copy link
Contributor

@sutaakar sutaakar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from 5674d65 to 1c36d60 Compare January 2, 2025 10:43
@openshift-ci openshift-ci bot removed the lgtm label Jan 2, 2025
@ChughShilpa
Copy link
Contributor

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Jan 2, 2025

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

@ChughShilpa
For the test, 1 epoch should be enough right?
Because as we increase number of epoch it will just increase number of training iteration by increasing time of execution.. and won't effect outcome of the test as base functionality is already being tested
If it is necessary in any case then I'm ok with increasing epoch count 👍

@ChughShilpa
Copy link
Contributor

@abhijeet-dhumal I see you set the epochs to 1 in test, IMO we should be using higher number of epochs

@ChughShilpa For the test, 1 epoch should be enough right? Because as we increase number of epoch it will just increase number of training iteration by increasing time of execution.. and won't effect outcome of the test as base functionality is already being tested If it is necessary in any case then I'm ok with increasing epoch count 👍

I think setting higher epochs provides benefits in identify resource issues or memory leaks that may arise during prolonged training. Also training the model in a more realistic scenario.

  1. We are testing with at least 3 epochs for other tests so far (consistency point of view, that can be ignored, just my preference)
  2. Number of Iteration depends on the batch size not the epochs.
  3. Yes, higher epochs increase the execution time, since MNIST is a lightweight, so it will not take much higher time.

@abhijeet-dhumal
Copy link
Contributor Author

abhijeet-dhumal commented Jan 3, 2025

@ChughShilpa @sutaakar I have tested all below cases:-

Training with:

  1. workerReplicas-1, epoch-1 --> 2m 28s
  2. workerReplicas-2, epoch-1 --> 2m 49s
  3. workerReplicas-2, epoch-2 --> 3m 5s

I will update this test accordingly, Thanks !

image

@sutaakar
Copy link
Contributor

sutaakar commented Jan 3, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jan 3, 2025
@abhijeet-dhumal
Copy link
Contributor Author

@sutaakar @ChughShilpa

CPU (Backend type - GLOO) : (WorkerReplicas-2, Epochs-3)
TestPyTorchJobMnistCpu - 3m 2s

GPU (backend type - NCCL) : (WorkerReplicas-1, Epochs-3)
TestPytorchjobMnistWithROCm - 3m 41s
TestPytorchjobMnistWithCuda - 2m 27s

Note : Initially incase of pulling training image it takes approx 10/12mins, in image below it took 14m 58s to run TestPytorchjobMnistWithCuda test

image

…to resolve fsspec/numpy package compatibility issue and added license in MNIST script
@abhijeet-dhumal abhijeet-dhumal force-pushed the add-kfto-multinode-tests branch from afdf49b to 7dbfdfc Compare January 3, 2025 10:54
@openshift-ci openshift-ci bot removed the lgtm label Jan 3, 2025
Copy link
Contributor

@ChughShilpa ChughShilpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link

openshift-ci bot commented Jan 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ChughShilpa, sutaakar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jan 3, 2025
@abhijeet-dhumal abhijeet-dhumal removed the request for review from astefanutti January 3, 2025 12:04
@openshift-merge-bot openshift-merge-bot bot merged commit a5a6d39 into opendatahub-io:main Jan 3, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants