forked from project-codeflare/codeflare-sdk
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
baec858
commit 043a14e
Showing
1 changed file
with
32 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# CodeFlare Stack Target Users | ||
|
||
[Cluster Admin](#cluster-administrator) | ||
|
||
[Data Scientist I](#data-scientist-i) | ||
|
||
[Data Scientist II](#data-scientist-ii) | ||
|
||
|
||
|
||
## Cluster Administrator | ||
|
||
* Quota Management | ||
* Gang-Scheduling for Distributed Compute | ||
* Job/Infrastructure Queuing | ||
|
||
I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of Operators via the OperatorHub UI. I also want the ability to see the current MCAD queue, active and requested resources on my clusters, and the progress of all current jobs visualized in a simple dashboard. | ||
|
||
## Data Scientist I | ||
|
||
* Training Mid-Size Models (less than 1,000 nodes) | ||
* Fine-Tuning Existing Models | ||
* Ray/KubeRay | ||
|
||
I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Ray Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Ray Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard. | ||
|
||
## Data Scientist II | ||
* Training Foundation Models (1,000+ nodes) | ||
* TorchX-MCAD | ||
* Ray/KubeRay | ||
|
||
I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment, which involves submitting my job directly to MCAD via TorchX, with the MCAD-Kubernetes scheduler or a Ray Cluster via TorchX, with the Ray scheduler. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging the CodeFlare CLI, or as a python script leveraging the CodeFlare SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard. |