From dcc99cea07081a6399056e5eeb14e40d4fb672f6 Mon Sep 17 00:00:00 2001 From: James Busche Date: Wed, 6 Sep 2023 14:10:05 -0700 Subject: [PATCH] mention guided demos Signed-off-by: James Busche --- Quick-Start.md | 134 ++----------------------------------------------- 1 file changed, 4 insertions(+), 130 deletions(-) diff --git a/Quick-Start.md b/Quick-Start.md index db4aa2a7..867cb2d1 100644 --- a/Quick-Start.md +++ b/Quick-Start.md @@ -49,13 +49,13 @@ NOTE: The above resources are just for the infrastructure pods. To be able to ru This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Open Data Hub (ODH) installation on your cluster. More information about ODH can be found [here](https://opendatahub.io/docs/quick-installation/). But the quick step to install ODH is as follows: -1. Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it with the default settings + - Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it with the default settings. (It should be version 1.Y.Z which you get from the default `rolling` channel) ### CodeFlare Operator The CodeFlare operator must be installed from the OperatorHub on your OpenShift cluster. -1. Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `CodeFlare Operator` and install it with the default settings +- Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `CodeFlare Operator` and install it with the default settings ### NFD and GPU Operators @@ -123,135 +123,9 @@ git clone https://github.com/project-codeflare/codeflare-sdk cd codeflare-sdk ``` -We will rely on this demo code to train an mnist model. So feel free to open `codeflare-sdk/demo-notebooks/guided-demos/2_basic_jobs.ipynb` to follow along instead. +### Run the Guided Demo Notebooks -### Run the demo notebook - -First, we will import what we need from the SDK. - -```python -# Import pieces from codeflare-sdk -from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration -from codeflare_sdk.cluster.auth import TokenAuthentication -from codeflare_sdk.job.jobs import DDPJobDefinition -``` - -Then we will go ahead and create an authentication object to access our cluster. - -```python -# Create authentication object for user permissions -# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config -# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually -auth = TokenAuthentication( - token = "XXXXX", - server = "XXXXX", - skip_tls=False -) -auth.login() -``` - -Next, we will define the configuration we'd like for our Ray cluster. A user can update this as needed for the resource requirements of their job. - -_Instascale specific configs:_ - -The configuration for `machine_types` is only used if you have instascale installed. It defines the machine types for the head node and worker nodes, in that order. You must also have the appropriate `machine_set` templates available on your cluster for instascale to recognize them. - -If you are working in an on-prem environment, and for the purposes of following this demo, you can simply set `instascale=False` and ignore the `machine_types` configuration. - -```python -cluster_config = ClusterConfiguration( - name='jobtest', - namespace="default", - num_workers=2, - min_cpus=1, - max_cpus=1, - min_memory=4, - max_memory=4, - num_gpus=0, - instascale=False, - machine_types = ["m4.xlarge", "g4dn.xlarge"] -) -``` - -In addition to instantiating our cluster object, this will also write a file, `jobtest.yaml`, to your working directory. This file defines an AppWrapper custom resource; everything MCAD needs to deploy your Ray cluster. - -Next, we can apply this YAML file and spin up our Ray cluster. - -```python -cluster.up() -cluster.wait_ready() -``` - -You can also check the cluster details with: -```python -cluster.details() -``` - -You can check the status of the Ray cluster and see when its ready to use with: - -```Python -cluster.status() -``` - -Once the cluster is up, you are ready to submit your first job. - -We are going to use the CodeFlare SDK to submit batch jobs via TorchX, either to the Ray cluster we have just brought up, or directly to MCAD. - -First, let's begin by submitting to Ray, training a basic NN on the MNIST dataset: - -The `mnist.py` file used comes from [here](https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/resources/mnist.py), which is accessed in your jupyter notebook under `codeflare-sdk/demo-notebooks/guided-demos/mnist.py` - -```python -jobdef = DDPJobDefinition( - name="mnisttest", - script="mnist.py", - scheduler_args={"requirements": "requirements.txt"} -) -job = jobdef.submit(cluster) -``` - -Once the job is submitted you can follow it on the Ray dashboard using the following commands to output the job status directly into your notebook: -```python -cluster.cluster_dashboard_uri() -``` - -```python -job.status() -``` - -```python -job.logs() -``` - -Finally, once the job is done you can shutdown your Ray nodes. -```python -cluster.down() -``` - -Great! With these guided demos, you have now submitted your first distributed training job with CodeFlare! - -Now, an alternative option for job submission is to submit directly to MCAD, which will schedule pods to run the job with requested resources: - -```python -jobdef = DDPJobDefinition( - name="mnistjob", - script="mnist.py", - scheduler_args={"namespace": "default"}, - j="1x1", - gpu=0, - cpu=1, - memMB=8000, - image="quay.io/project-codeflare/mnist-job-test:v0.0.1" -) -job = jobdef.submit() -``` -Once again, we can look at job status and logs as performed previously. - -To finalize, the following commands can also be used to delete jobs early for both Ray and MCAD submission and logout and free up the resources on your cluster: -```python -job.cancel() -auth.logout() -``` +There are a number of guided demos you can follow to become familiar with the CodeFlare-SDK and the CodeFlare stack. Navigate to the path: `codeflare-sdk/demo-notebooks/guided-demos` to see and run the latest demos. ## Cleaning up the CodeFlare Install To completely clean up all the CodeFlare components after an install, follow these steps: