Skip to content

Commit

Permalink
make user descriptions technology generic
Browse files Browse the repository at this point in the history
  • Loading branch information
MichaelClifford committed Jul 12, 2023
1 parent 043a14e commit 80317a4
Showing 1 changed file with 5 additions and 6 deletions.
11 changes: 5 additions & 6 deletions target_users.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,18 @@
* Gang-Scheduling for Distributed Compute
* Job/Infrastructure Queuing

I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of Operators via the OperatorHub UI. I also want the ability to see the current MCAD queue, active and requested resources on my clusters, and the progress of all current jobs visualized in a simple dashboard.
I want to enable a team of data scientists to have self-serve, but limited, access to a shared pool of distributed compute resources such as GPUs for large scale machine learning model training jobs. If the existing pool of resources is insufficient, I want my cluster to scale up (to a defined quota) to meet my users’ needs and scale back down automatically when their jobs have completed. I want these features to be made available through simple installation of generic modules via a user-friendly interface. I also want the ability to monitor current queue of pending tasks, the utilization of active resources, and the progress of all current jobs visualized in a simple dashboard.

## Data Scientist I

* Training Mid-Size Models (less than 1,000 nodes)
* Fine-Tuning Existing Models
* Ray/KubeRay
* Distributed Compute Framework

I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Ray Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Ray Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard.
I need temporary access to a reasonably large set of GPU enabled nodes on my team’s shared cluster for short term experimentation, parallelizing my existing ML workflow, or fine-tuning existing large scale models. I’d prefer to work from a notebook environment with access to a python sdk that I can use to request the creation of Framework Clusters that I can distribute my workloads across. In addition to interactive experimentation work, I also want the ability to “fire-and-forget” longer running ML jobs onto temporarily deployed Framework Clusters with the ability to monitor these jobs while they are running and access to all of their artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.

## Data Scientist II
* Training Foundation Models (1,000+ nodes)
* TorchX-MCAD
* Ray/KubeRay
* Distributed Compute Framework

I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment, which involves submitting my job directly to MCAD via TorchX, with the MCAD-Kubernetes scheduler or a Ray Cluster via TorchX, with the Ray scheduler. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging the CodeFlare CLI, or as a python script leveraging the CodeFlare SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard.
I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging a CLI, or as a python script leveraging an SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current queue and the progress of all my current jobs visualized in a simple dashboard.

0 comments on commit 80317a4

Please sign in to comment.