Add KFTO Training tests which run with CPUs only #292

ChughShilpa · 2024-12-10T14:12:26Z

Description

This PR adds KFTO Training tests to run with CPUs only with smaller dataset to execute tests in limited time so that these tests can be used in downstream testing

How Has This Been Tested?

Tested the KFTO traininf tests locally

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

sutaakar · 2024-12-10T14:22:32Z

tests/kfto/core/kfto_training_test.go

 									Command: []string{"/bin/sh", "-c"},
-									Args:    []string{"mkdir /tmp/all_datasets; cp -r /dataset/* /tmp/all_datasets;ls /tmp/all_datasets"},
+									Args:    []string{"mkdir /tmp/all_datasets; cp -r /dataset/$(DATASET_SIZE) /tmp/all_datasets/alpaca_data.json"},


IMHO you can use datasetSize property here directly.

Yes, updated the code to use datasetSize parameter directly

sutaakar · 2024-12-10T14:23:26Z

tests/kfto/core/kfto_training_test.go

-func TestPyTorchJobWithCuda(t *testing.T) {
-	runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", 1)
+func TestPyTorchJobWithCudaGpu(t *testing.T) {
+	runKFTOPyTorchJob(t, GetCudaTrainingImage(), "nvidia.com/gpu", "alpaca_data_hundredth.json", 1, 2, "8Gi")


It may be better to create ResourceList for every test case separately and pass it as one parameter.
This way it is easier to see what resources are used for what test case.

Added ResourceList struct and used the list separately in each test case

tests/kfto/core/kfto_training_test.go

openshift-ci · 2024-12-12T06:35:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szaher for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

abhijeet-dhumal

Thanks @ChughShilpa, I tested and verified this added test using m5.4xlarge instance type, it took 6 mins to run whole test ✔️

According to this line , it needs 12 CPUs to be present on single cluster node running master node..
so it needs minimum --> m5.4xlarge flavour instance type which has 16 vCPUs by default

ChughShilpa · 2025-01-02T12:43:56Z

We decided to proceed with lightweight dataset like MNIST to reduce consumption of CPU resources. So closing this PR and create another one for MNIST tests

ChughShilpa requested review from sutaakar, Fiona-Waters and abhijeet-dhumal December 10, 2024 14:12

openshift-ci bot requested a review from Bobbins228 December 10, 2024 14:12

sutaakar reviewed Dec 10, 2024

View reviewed changes

tests/kfto/core/kfto_training_test.go Outdated Show resolved Hide resolved

ChughShilpa force-pushed the KFTO_CPU branch from 074fe94 to 386547e Compare December 12, 2024 06:35

ChughShilpa force-pushed the KFTO_CPU branch 2 times, most recently from 3f184f6 to 7c2405a Compare December 12, 2024 09:16

Add KFTO Training tests which run with CPUs only

f30d5fe

ChughShilpa force-pushed the KFTO_CPU branch from 7c2405a to f30d5fe Compare December 12, 2024 09:36

ChughShilpa marked this pull request as draft December 16, 2024 13:45

openshift-ci bot added the do-not-merge/work-in-progress label Dec 16, 2024

abhijeet-dhumal reviewed Dec 17, 2024

View reviewed changes

ChughShilpa closed this Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KFTO Training tests which run with CPUs only #292

Add KFTO Training tests which run with CPUs only #292

ChughShilpa commented Dec 10, 2024

sutaakar Dec 10, 2024

ChughShilpa Dec 12, 2024

sutaakar Dec 10, 2024

ChughShilpa Dec 12, 2024

openshift-ci bot commented Dec 12, 2024

abhijeet-dhumal left a comment

ChughShilpa commented Jan 2, 2025

Add KFTO Training tests which run with CPUs only #292

Add KFTO Training tests which run with CPUs only #292

Conversation

ChughShilpa commented Dec 10, 2024

Description

How Has This Been Tested?

Merge criteria:

sutaakar Dec 10, 2024

Choose a reason for hiding this comment

ChughShilpa Dec 12, 2024

Choose a reason for hiding this comment

sutaakar Dec 10, 2024

Choose a reason for hiding this comment

ChughShilpa Dec 12, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Dec 12, 2024

abhijeet-dhumal left a comment

Choose a reason for hiding this comment

ChughShilpa commented Jan 2, 2025