Skip to content

Commit

Permalink
add new document for v2 quickstart
Browse files Browse the repository at this point in the history
Signed-off-by: Kevin <[email protected]>
  • Loading branch information
KPostOffice authored and openshift-merge-robot committed Sep 12, 2023
1 parent df7a020 commit b3d5af1
Show file tree
Hide file tree
Showing 4 changed files with 172 additions and 3 deletions.
141 changes: 141 additions & 0 deletions Quick-Start-ODH-V2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Quick Start Guide for Distributed Workloads with the CodeFlare Stack

This quick start guide is intended to walk existing Open Data Hub users through installation of the CodeFlare stack and an initial demo using the CodeFlare-SDK from within a Jupyter notebook environment. This will enable users to run and submit distributed workloads.

The CodeFlare-SDK was built to make managing distributed compute infrastructure in the cloud easy and intuitive for Data Scientists. However, that means there needs to be some cloud infrastructure on the backend for users to get the benefit of using the SDK. Currently, we support the CodeFlare stack, which consists of the Open Source projects, [MCAD](https://github.com/project-codeflare/multi-cluster-app-dispatcher), [Instascale](https://github.com/project-codeflare/instascale), [Ray](https://www.ray.io/), and [Pytorch](https://pytorch.org/).

This stack integrates well with [Open Data Hub](https://opendatahub.io/), and helps to bring batch workloads, jobs, and queuing to the Data Science platform.

## Prerequisites

### Resources

In addition to the resources required by default ODH deployments, you will need the following to deploy the Distributed
Workloads stack infrastructure pods:

```text
Total:
CPU: 4100m
Memory: 4608Mi
# By component
Ray:
CPU: 100m
Memory: 512Mi
MCAD
cpu: 2000m
memory: 2Gi
InstaScale:
cpu: 2000m
memory: 2Gi
```

NOTE: The above resources are just for the infrastructure pods. To be able to run actual workloads on your cluster you
will need additional resources based on the size and type of workload.

### OpenShift and Open Data Hub

This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Open Data Hub (ODH) installation with version **~2.Y** is present on your cluster. More information about ODH can be found [here](https://opendatahub.io/docs/quick-installation/). But the quick step to install ODH is as follows:

- Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it using the `fast` channel. (It should be version 2.Y.Z)

### CodeFlare Operator

The CodeFlare operator must be installed from the OperatorHub on your OpenShift cluster. The default settings will
suffice.

### NFD and GPU Operators

If you want to run GPU enabled workloads, you will need to install the [Node Feature Discovery Operator](https://github.com/openshift/cluster-nfd-operator) and the [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) from the OperatorHub. For instructions on how to install and configure these operators, we recommend [this guide](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/steps-overview.html#high-level-steps).


## Creating K8s resources

1. Create the opendatahub namespace with the following command:

```bash
oc create ns opendatahub
```

1. Create a datascience cluster with CodeFlare and Ray enabled:

```bash
oc apply -f https://raw.githubusercontent.com/opendatahub-io/distributed-workloads/main/codeflare-dsc.yaml
```

Applying the above DataScienceCluster will result in the following objects being added to your cluster:

1. MCAD
1. InstaScale
1. KubeRay Operator
1. CodeFlare Notebook Image for the Open Data Hub notebook interface

This image is managed by project CodeFlare and contains the correct packages of codeflare-sdk, pytorch, torchx, ect required to run distributed workloads.

At this point you should be able to go to your notebook spawner page and select "Codeflare Notebook" from your list of notebook images and start an instance.
You can access the spawner page through the Open Data Hub dashboard. The default route should be `https://odh-dashboard-<your ODH namespace>.apps.<your cluster's uri>`. Once you are on your dashboard, you can select "Launch application" on the Jupyter application. This will take you to your notebook spawner page.
### Using an Openshift Dedicated or ROSA Cluster
If you are using an Openshift Dedicated or ROSA Cluster you will need to create a secret in the opendatahub namespace containing your ocm token. You can find your token [here](https://console.redhat.com/openshift/token). Navigate to Workloads -> secrets in the Openshift Console. Click Create and choose a key/value secret. Secret name: instascale-ocm-secret, Key: token, Value: < ocm token > and click create.
<img src="images/instascale-ocm-secret.png" width="80%" height="80%">
## Submit your first job
We can now go ahead and submit our first distributed model training job to our cluster.
This can be done from any python based environment, including a script or a jupyter notebook. For this guide, we'll assume you've selected the "Codeflare Notebook" from the list of available images on your notebook spawner page.
### Clone the demo code
Once your notebook environment is ready, in order to test our CodeFlare stack we will want to run though some of the demo notebooks provided by the CodeFlare community. So let's start by cloning their repo into our working environment.
```bash
git clone https://github.com/project-codeflare/codeflare-sdk
cd codeflare-sdk
```
### Run the Guided Demo Notebooks
There are a number of guided demos you can follow to become familiar with the CodeFlare-SDK and the CodeFlare stack. Navigate to the path: `codeflare-sdk/demo-notebooks/guided-demos` to see and run the latest demos.
## Cleaning up the CodeFlare Install
To completely clean up all the CodeFlare components after an install, follow these steps:
1. No appwrappers should be left running:
```bash
oc get appwrappers -A
```
If any are left, you'd want to delete them
2. Remove the notebook and notebook pvc:
```bash
oc delete notebook jupyter-nb-kube-3aadmin -n opendatahub
oc delete pvc jupyterhub-nb-kube-3aadmin-pvc -n opendatahub
```
3. Remove the example datascience cluster: (Removes MCAD, InstaScale, KubeRay and the Notebook image)
``` bash
oc delete dsc example-dsc
```
4. Remove the CodeFlare Operator csv and subscription: (Removes the CodeFlare Operator from the OpenShift Cluster)
```bash
oc delete sub codeflare-operator -n openshift-operators
oc delete csv `oc get csv -n opendatahub |grep codeflare-operator |awk '{print $1}'` -n openshift-operators
```
5. Remove the CodeFlare CRDs
```bash
oc delete crd instascales.codeflare.codeflare.dev mcads.codeflare.codeflare.dev schedulingspecs.mcad.ibm.com appwrappers.mcad.ibm.com quotasubtrees.ibm.com
```
## Next Steps
And with that you have gotten started using the CodeFlare stack alongside your Open Data Hub Deployment to add distributed workloads and batch computing to your machine learning platform.
You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community.
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/2_basic_jobs.ipynb)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb)
4 changes: 2 additions & 2 deletions Quick-Start.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,5 +163,5 @@ And with that you have gotten started using the CodeFlare stack alongside your O
You are now ready to try out the stack with your own machine learning workloads. If you'd like some more examples, you can also run through the existing demo code provided by the Codeflare-SDK community.
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/guided-demos)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/tree/main/demo-notebooks/additional-demos)
* [Submit batch jobs](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/2_basic_jobs.ipynb)
* [Run an interactive session](https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb)
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,6 @@ Integration of this stack into the Open Data Hub is owned by the Distributed Wor

## Quick Start

Follow our quick start guide [here](/Quick-Start.md) to get up and running with Distributed Workloads on Open Data Hub.
Follow our quick start guide [here](/Quick-Start.md) to get up and running with Distributed Workloads on Open Data Hub.

For the V2 version of the ODH operator follow [this](/Quick-Start-ODH-V2.md) guide instead.
26 changes: 26 additions & 0 deletions codeflare-dsc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
apiVersion: datasciencecluster.opendatahub.io/v1alpha1
kind: DataScienceCluster
metadata:
labels:
app.kubernetes.io/created-by: opendatahub-operator
app.kubernetes.io/instance: default
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: datasciencecluster
app.kubernetes.io/part-of: opendatahub-operator
name: example-dsc
spec:
components:
codeflare:
enabled: true
dashboard:
enabled: true
datasciencepipelines:
enabled: false
kserve:
enabled: false
modelmeshserving:
enabled: false
ray:
enabled: true
workbenches:
enabled: true

0 comments on commit b3d5af1

Please sign in to comment.