-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KFTO Training tests which run with CPUs only #292
Conversation
Command: []string{"/bin/sh", "-c"}, | ||
Args: []string{"mkdir /tmp/all_datasets; cp -r /dataset/* /tmp/all_datasets;ls /tmp/all_datasets"}, | ||
Args: []string{"mkdir /tmp/all_datasets; cp -r /dataset/$(DATASET_SIZE) /tmp/all_datasets/alpaca_data.json"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO you can use datasetSize
property here directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated the code to use datasetSize
parameter directly
func TestPyTorchJobWithCuda(t *testing.T) { | ||
runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", 1) | ||
func TestPyTorchJobWithCudaGpu(t *testing.T) { | ||
runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", "alpaca_data_hundredth.json", 1, 2, "8Gi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be better to create ResourceList
for every test case separately and pass it as one parameter.
This way it is easier to see what resources are used for what test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added ResourceList struct and used the list separately in each test case
074fe94
to
386547e
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
3f184f6
to
7c2405a
Compare
7c2405a
to
f30d5fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ChughShilpa, I tested and verified this added test using m5.4xlarge instance type, it took 6 mins to run whole test ✔️
According to this line , it needs 12 CPUs to be present on single cluster node running master node..
so it needs minimum --> m5.4xlarge flavour instance type which has 16 vCPUs by default
We decided to proceed with lightweight dataset like MNIST to reduce consumption of CPU resources. So closing this PR and create another one for MNIST tests |
Closes RHOAIENG-16556
Description
This PR adds KFTO Training tests to run with CPUs only with smaller dataset to execute tests in limited time so that these tests can be used in downstream testing
How Has This Been Tested?
Tested the KFTO traininf tests locally
Merge criteria: