KEP-2170: Kubeflow Training V2 API #2171

andreyvelich · 2024-07-17T01:40:11Z

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw
Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:

datasetConfig:
  storageUri: hf://yelp_review_full
  env:
    - name: SPLIT
      value: train[:4000]
--- or ---
datasetConfig:
  huggingface:
    repoId: yelp_review_full
    split: train[:4000]

What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?
TrainJobStatus needs to be updated with more info.

/cc @kubeflow/wg-training-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-data-leads @danielvegamyhre @kannon92 @mimowo @ahg-g @kuizhiqing @sandipanpanda @Electronic-Waste @helenxie-bit @franciscojavierarceo @diegolovison @alculquicondor @zw0610 @lowang-bh @deepak-muley @bigsur0 @Syulin7 @itayvallach @richardsliu @shravan-achar @akshaychitneni @StefanoFioravanzo @zanetworker @nairbv @vsoch @yuzisun @brannondorsey @kellyaa

/hold for review

google-oss-prow · 2024-07-17T01:40:27Z

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: kubeflow/wg-data-leads, nairbv, danielvegamyhre, sandipanpanda, bigsur0, kannon92, itayvallach, brannondorsey, kellyaa, ahg-g, shravan-achar, vsoch, akshaychitneni, zanetworker, kubeflow/kubeflow-steering-committee, mimowo.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw
Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:
datasetConfig:
  storageUri: hf://yelp_review_full
  env:
    - name: SPLIT
      value: train[:4000]
--- or ---
datasetConfig:
  huggingface:
    repoId: yelp_review_full
    split: train[:4000]
What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

TrainJobStatus needs to be updated with more info.

/cc @kubeflow/wg-training-leads @kubeflow/kubeflow-steering-committee @kubeflow/wg-data-leads @danielvegamyhre @kannon92 @mimowo @ahg-g @kuizhiqing @sandipanpanda @Electronic-Waste @helenxie-bit @franciscojavierarceo @diegolovison @alculquicondor @zw0610 @lowang-bh @deepak-muley @bigsur0 @Syulin7 @itayvallach @richardsliu @shravan-achar @akshaychitneni @StefanoFioravanzo @zanetworker @nairbv @vsoch @yuzisun @brannondorsey @kellyaa

/hold for review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2024-07-17T01:45:01Z

Pull Request Test Coverage Report for Build 10268126274

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.008%) to 33.477%

Totals
Change from base Build 10252995442:	0.008%
Covered Lines:	3949
Relevant Lines:	11796

💛 - Coveralls

vsoch

This is really exciting! I added some comment on question 1 in my notes below, and for 2:

Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:

My preference is for storage URI - it's more flexible, and a design that workflow tools in the HPC community (and even container technologies) have been using and it's simple, straight forward, and seems to work!

I'll try to ping others on my team for feedback, and I hope we get to talk about this at the next batch wg meeting.

docs/proposals/2170-kubeflow-training-v2/README.md

vsoch · 2024-07-17T05:09:00Z

docs/proposals/2170-kubeflow-training-v2/README.md

+spec:
+  trainingRuntimeRef:
+    name: pytorch-distributed
+  trainerConfig:


This would be passed to the scheduler (eventually) - I'm a bit tired so the path there isn't totally clear - I'm guessing it will eventually be turned into units of pods with resource requests.

docs/proposals/2170-kubeflow-training-v2/README.md

vsoch · 2024-07-17T05:11:29Z

docs/proposals/2170-kubeflow-training-v2/README.md

+      numProcPerNode: 5
+  numNodes: 5
+  replicatedJobs:
+    - name: Launcher


I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launcher - that's a design that the MPI operator has taken (and makes sense, it's part of how MPI design is described), but for example, the lead broker of the flux operator does the launch with that node handling the same work as a worker.

Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq).

I'm a little worried about this design because MPI doesn't always need a node that is explicitly for a launchel

Yes, I know we have an API to run launcher as worker in MPI-Operator: https://github.com/kubeflow/mpi-operator/blob/master/pkg/apis/kubeflow/v2beta1/types.go#L161.

I want to hear @tenzen-y and @alculquicondor opinion on this. We didn't get a chance to deeply discuss how MPI Runtime should look like.

Follow up question - could the Flux Operator be handled here? High level, it's an MPI launcher (but actually is an entire HPC cluster that simply can be used to run one MPI job - it's an indexed job + headless service under the hood, with flux orchestration and nodes connected via a tree based overlay network with zeromq)

Yes, I think that is possible. Do you know if we can construct the Flux cluster for batch using JobSet spec or we need to use Flux operator mini-cluster spec ?

JobSetSpec *batchv1.JobSetSpec `json:",inline"`

The Flux Operator (under the hood) is an indexed job, so yes I could very easily turn that into a JobSet. I actually did a design with JobSet early on, but didn't wind up converting entirely for simple reasons - JobSet wasn't part of Kubernetes core (and would require the user to install an additional thing), and JobSet makes the DNS names much longer, and (in my testing) it often led to errors depending on the name of the job that I chose. Since I just needed the IndexedJob and then the headless service, it was more straight forward to just implement that.

Please put me down to implement Flux Framework into this setup - the nested hierarchical scheduler can handle a lot of really cool stuff within a single job, and the zeromq bootstrap is an improvement over ssh.

Would be interesting to see the benchmarks to compare zmq over ssh for distributed model training. cc @tenzen-y

I can definitely help with that, if there is interest to run experiments. We compared an HPC proxy app (LAMMPS) to the MPI operator, now that was a long time ago (early 2023) and there was about a 5% difference in means, and you can imagine that would compound with runtime (our runs were very short). That said, both have changed immensely since then!

I have a hard time getting the MPI Operator working, but if we wanted to decide on a workflow, I could offer to set it up and run in the Flux Operator. We would need to choose a cloud that has a suitable network and resources for the work. If it's ML then probably Google would work - I have credits currently.

I still think that both MPIJob modes (runLauncherAsWorker and specific launcher) would be helpful for the GPU and CPU mixed clusters. Additionally, the launcher allows us to easily migrate from MPIJob v2 to TrainJob.

Temporarily, I will add the runLauncherAsNode API. Let's discuss it again after we have MPI runtimes.

docs/proposals/2170-kubeflow-training-v2/README.md

franciscojavierarceo · 2024-07-25T18:16:38Z

This is the Kubeflow Enhancement Proposal for Kubeflow Training V2 APIs: https://bit.ly/3WzjTlw Related: #2170

We will collect the final community feedback by mid of next week and start the implementation.

Open Questions:

Since we are planning to introduce OutputArtifact to the ModelConfig, how should we support one-off for different model providers (e.g. HF, PVC, S3, etc.) ? Another option could be to add OutputArtifact as part of TrainerConfig.

FWIW I am in favor of adding OutputArtifact to TrainerConfig.

Do we want to keep each parameter for dataset and model providers in our APIs, or we should just introduce storageUri field and environment variable that user can modify:
datasetConfig:
  storageUri: hf://yelp_review_full
  env:
    - name: SPLIT
      value: train[:4000]
--- or ---
datasetConfig:
  huggingface:
    repoId: yelp_review_full
    split: train[:4000]

I personally like the latter approach rather than an environment variable but just my opinion. 🤷

What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

Those seems good. I would also recommend one of the good small ones (e.g., Mistral 7B - DPO).

TrainJobStatus needs to be updated with more info.

👍

docs/proposals/2170-kubeflow-training-v2/README.md

franciscojavierarceo

Left some small nits, just for readability.

docs/proposals/2170-kubeflow-training-v2/README.md

andreyvelich · 2024-07-26T15:59:03Z

I personally like the latter approach rather than an environment variable but just my opinion. 🤷

@franciscojavierarceo Please can you check the latest API for model and dataset configs ? I recently made some updates based on this 🧵 #2171 (comment)

What LLM Fine-Tuning Runtimes do we want to add initially. E.g. Llama-7b, Gemma-7b, Llama-70b?

I think, after initial implementation we can add blueprints for more LLMs: Mistral, BERT, Llama, etc.

Signed-off-by: Andrey Velichkevich <[email protected]>

Add runLauncherAsNode parameter Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

vsoch · 2024-08-06T12:41:24Z

docs/proposals/2170-kubeflow-training-v2/README.md

+                    - torchrun train.py
+```
+
+Training Operator will create the `PodGroup` using the following spec:


This makes sense. Naive question - does this imply that any custom gang scheduler used is required to make a PodGroup to function? Or if you have a plugin that doesn't create a PodGroup (and does its own thing) the group would be made anyway? And from the example above, how is minMember of 5 derived? You would probably need custom logic per scheduler backend for this API. For example, the PodGroup created by coscheduling vs volcano.sh are different underlying objects. (https://github.com/kubernetes-sigs/scheduler-plugins/blob/d67d2c15199595ccc6218574bd8e07b68b7146f4/apis/scheduling/v1alpha1/types.go#L128 and https://github.com/volcano-sh/apis/blob/78d912ce096c9fd9120488b5c7b999d4996f2cb5/pkg/apis/scheduling/types.go#L147)

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK. Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

does this imply that any custom gang scheduler used is required to make a PodGroup to function?

The PodGroup creation will be implemented only for supported gang-schedulers (initially Volcano and Coscheduling). Since different schedulers require the various PodGroups to be created.

And from the example above, how is minMember of 5 derived?

minMember is always equal to numNodes. For ML Training all gangs should be alive to execute training. In the future, if we find other use-cases (e.g. HPC), we can discuss it again.

If this Training API is going t take on the challenge of orchestrating the logic for creation of needed PodGroup (varying by the plugin) then this probably works OK

Yeah, that's correct.

Otherwise, I'd still place the responsibility on the user for now for creating the pod group, but just allow the schedulerName and other labels to use it in whatever turns into the actual job or podspec.

Btw, this also will be supported, since users can just ignore PodGroupSpec parameter, and set the .spec.schedulerName field in the PodSpec.

Signed-off-by: Andrey Velichkevich <[email protected]>

tenzen-y

Thanks for the updates!
/lgtm

tenzen-y · 2024-08-06T16:12:26Z

docs/proposals/2170-kubeflow-training-v2/README.md

+spec:
+  scheduleTimeoutSeconds: 100
+  minMember: 5
+```


That is a great PodGroup specification unveil!

andreyvelich · 2024-08-06T16:55:33Z

We should be ready to merge this and start the initial implementation.
Feel free to submit the follow-up PRs if we need any updates to the KEP.

Thanks to everyone for the comprehensive review of this proposal!
I am looking forward to see the adoption of these features 🎉

/hold cancel

google-oss-prow bot added the do-not-merge/hold label Jul 17, 2024

google-oss-prow bot requested review from lowang-bh, diegolovison, helenxie-bit, deepak-muley, yuzisun, a team, Syulin7, franciscojavierarceo, alculquicondor, StefanoFioravanzo, zw0610, Electronic-Waste, kuizhiqing and richardsliu July 17, 2024 01:40

google-oss-prow bot added the size/XXL label Jul 17, 2024

vsoch reviewed Jul 17, 2024

View reviewed changes

andreyvelich force-pushed the training-v2-proposal branch from 14aac85 to 85dee06 Compare July 18, 2024 15:55

andreyvelich mentioned this pull request Jul 25, 2024

JobSetTemplate API kubernetes-sigs/jobset#573

Closed

kannon92 reviewed Jul 25, 2024

View reviewed changes

docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Jul 25, 2024

View reviewed changes

docs/proposals/2170-kubeflow-training-v2/README.md Outdated Show resolved Hide resolved

kannon92 reviewed Jul 25, 2024

View reviewed changes

docs/proposals/2170-kubeflow-training-v2/README.md Show resolved Hide resolved

franciscojavierarceo reviewed Jul 25, 2024

View reviewed changes

docs/proposals/2170-kubeflow-training-v2/README.md Show resolved Hide resolved

andreyvelich mentioned this pull request Jul 26, 2024

Add CNAI Patterns/Blueprints cncf/tag-runtime#175

Open

andreyvelich mentioned this pull request Jul 30, 2024

[feature] Support different PS/worker types #1369

Closed

andreyvelich added 18 commits August 6, 2024 12:46

Update diagram

39c23ba

Signed-off-by: Andrey Velichkevich <[email protected]>

Refactor Model and Dataset configs

3b43220

Signed-off-by: Andrey Velichkevich <[email protected]>

Update runtime timelines

edefff2

Signed-off-by: Andrey Velichkevich <[email protected]>

Address readability comments

38ed8f9

Signed-off-by: Andrey Velichkevich <[email protected]>

Explaination for Trainer

a87177b

Signed-off-by: Andrey Velichkevich <[email protected]>

Update LLM Fine-Tuning Diagram

47dbd31

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix Llama model name

57d9591

Signed-off-by: Andrey Velichkevich <[email protected]>

Add goal for integration with Kueue

bc373d9

Signed-off-by: Andrey Velichkevich <[email protected]>

Add links for Job run policies

3f5d7bb

Signed-off-by: Andrey Velichkevich <[email protected]>

Add some alternatives

619d167

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix more API types

1c423ab

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix empty number of nodes

ca23867

Signed-off-by: Andrey Velichkevich <[email protected]>

Rename to Coscheduling

77170da

Signed-off-by: Andrey Velichkevich <[email protected]>

Change parameters to env

9acf8a3

Add runLauncherAsNode parameter Signed-off-by: Andrey Velichkevich <[email protected]>

Update PodSpecOverride with scheduling directives

287a4a4

Signed-off-by: Andrey Velichkevich <[email protected]>

Fix TrainingRuntime field

be8177b

Signed-off-by: Andrey Velichkevich <[email protected]>

Refactor PodGroupSpec APIs

f80e780

Signed-off-by: Andrey Velichkevich <[email protected]>

Add note about scheduler name

08fec42

Signed-off-by: Andrey Velichkevich <[email protected]>

andreyvelich force-pushed the training-v2-proposal branch from 643598f to 08fec42 Compare August 6, 2024 11:47

Add initial TrainJob status field

d1c1994

Signed-off-by: Andrey Velichkevich <[email protected]>

vsoch reviewed Aug 6, 2024

View reviewed changes

Fix pre-commit

87ed153

Signed-off-by: Andrey Velichkevich <[email protected]>

tenzen-y reviewed Aug 6, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Aug 6, 2024

google-oss-prow bot removed the do-not-merge/hold label Aug 6, 2024

google-oss-prow bot merged commit 53341c9 into kubeflow:master Aug 6, 2024
38 checks passed

andreyvelich deleted the training-v2-proposal branch August 6, 2024 16:57

This was referenced Aug 9, 2024

KEP-2170: Kubeflow Trainer V2 API #2170

Open

KEP-2170: Add TrainJob and TrainingRuntime APIs #2223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP-2170: Kubeflow Training V2 API #2171

KEP-2170: Kubeflow Training V2 API #2171

andreyvelich commented Jul 17, 2024

google-oss-prow bot commented Jul 17, 2024

Open Questions:

coveralls commented Jul 17, 2024 •

edited

Loading

vsoch left a comment

vsoch Jul 17, 2024

vsoch Jul 17, 2024

vsoch Jul 17, 2024

andreyvelich Jul 18, 2024

andreyvelich Jul 18, 2024

vsoch Jul 18, 2024

andreyvelich Jul 18, 2024

vsoch Jul 18, 2024

tenzen-y Aug 1, 2024

andreyvelich Aug 2, 2024

franciscojavierarceo commented Jul 25, 2024 •

edited

Loading

Open Questions:

franciscojavierarceo left a comment

andreyvelich commented Jul 26, 2024 •

edited

Loading

vsoch Aug 6, 2024

vsoch Aug 6, 2024

andreyvelich Aug 6, 2024

tenzen-y left a comment

tenzen-y Aug 6, 2024

andreyvelich commented Aug 6, 2024

KEP-2170: Kubeflow Training V2 API #2171

KEP-2170: Kubeflow Training V2 API #2171

Conversation

andreyvelich commented Jul 17, 2024

Open Questions:

google-oss-prow bot commented Jul 17, 2024

Open Questions:

coveralls commented Jul 17, 2024 • edited Loading

Pull Request Test Coverage Report for Build 10268126274

Details

💛 - Coveralls

vsoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

franciscojavierarceo commented Jul 25, 2024 • edited Loading

Open Questions:

franciscojavierarceo left a comment

Choose a reason for hiding this comment

andreyvelich commented Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Aug 6, 2024

coveralls commented Jul 17, 2024 •

edited

Loading

franciscojavierarceo commented Jul 25, 2024 •

edited

Loading

andreyvelich commented Jul 26, 2024 •

edited

Loading