Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KFTO Training tests which run with CPUs only #292

Closed
wants to merge 1 commit into from

Conversation

ChughShilpa
Copy link
Contributor

Closes RHOAIENG-16556

Description

This PR adds KFTO Training tests to run with CPUs only with smaller dataset to execute tests in limited time so that these tests can be used in downstream testing

How Has This Been Tested?

Tested the KFTO traininf tests locally

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Command: []string{"/bin/sh", "-c"},
Args: []string{"mkdir /tmp/all_datasets; cp -r /dataset/* /tmp/all_datasets;ls /tmp/all_datasets"},
Args: []string{"mkdir /tmp/all_datasets; cp -r /dataset/$(DATASET_SIZE) /tmp/all_datasets/alpaca_data.json"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO you can use datasetSize property here directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated the code to use datasetSize parameter directly

func TestPyTorchJobWithCuda(t *testing.T) {
runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", 1)
func TestPyTorchJobWithCudaGpu(t *testing.T) {
runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", "alpaca_data_hundredth.json", 1, 2, "8Gi")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be better to create ResourceList for every test case separately and pass it as one parameter.
This way it is easier to see what resources are used for what test case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added ResourceList struct and used the list separately in each test case

Copy link

openshift-ci bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szaher for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ChughShilpa ChughShilpa force-pushed the KFTO_CPU branch 2 times, most recently from 3f184f6 to 7c2405a Compare December 12, 2024 09:16
Copy link
Contributor

@abhijeet-dhumal abhijeet-dhumal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChughShilpa, I tested and verified this added test using m5.4xlarge instance type, it took 6 mins to run whole test ✔️

According to this line , it needs 12 CPUs to be present on single cluster node running master node..
so it needs minimum --> m5.4xlarge flavour instance type which has 16 vCPUs by default

@ChughShilpa
Copy link
Contributor Author

We decided to proceed with lightweight dataset like MNIST to reduce consumption of CPU resources. So closing this PR and create another one for MNIST tests

@ChughShilpa ChughShilpa closed this Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants