Skip to content

Commit

Permalink
mention guided demos
Browse files Browse the repository at this point in the history
Signed-off-by: James Busche <[email protected]>
  • Loading branch information
jbusche authored and openshift-merge-robot committed Sep 6, 2023
1 parent eb89da4 commit dcc99ce
Showing 1 changed file with 4 additions and 130 deletions.
134 changes: 4 additions & 130 deletions Quick-Start.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,13 @@ NOTE: The above resources are just for the infrastructure pods. To be able to ru

This Quick Start guide assumes that you have administrator access to an OpenShift cluster and an existing Open Data Hub (ODH) installation on your cluster. More information about ODH can be found [here](https://opendatahub.io/docs/quick-installation/). But the quick step to install ODH is as follows:

1. Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it with the default settings
- Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `Open Data Hub Operator` and install it with the default settings. (It should be version 1.Y.Z which you get from the default `rolling` channel)

### CodeFlare Operator

The CodeFlare operator must be installed from the OperatorHub on your OpenShift cluster.

1. Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `CodeFlare Operator` and install it with the default settings
- Using the OpenShift UI, navigate to Operators --> OperatorHub and search for `CodeFlare Operator` and install it with the default settings

### NFD and GPU Operators

Expand Down Expand Up @@ -123,135 +123,9 @@ git clone https://github.com/project-codeflare/codeflare-sdk
cd codeflare-sdk
```
We will rely on this demo code to train an mnist model. So feel free to open `codeflare-sdk/demo-notebooks/guided-demos/2_basic_jobs.ipynb` to follow along instead.
### Run the Guided Demo Notebooks
### Run the demo notebook
First, we will import what we need from the SDK.
```python
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication
from codeflare_sdk.job.jobs import DDPJobDefinition
```
Then we will go ahead and create an authentication object to access our cluster.
```python
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
token = "XXXXX",
server = "XXXXX",
skip_tls=False
)
auth.login()
```
Next, we will define the configuration we'd like for our Ray cluster. A user can update this as needed for the resource requirements of their job.
_Instascale specific configs:_
The configuration for `machine_types` is only used if you have instascale installed. It defines the machine types for the head node and worker nodes, in that order. You must also have the appropriate `machine_set` templates available on your cluster for instascale to recognize them.
If you are working in an on-prem environment, and for the purposes of following this demo, you can simply set `instascale=False` and ignore the `machine_types` configuration.
```python
cluster_config = ClusterConfiguration(
name='jobtest',
namespace="default",
num_workers=2,
min_cpus=1,
max_cpus=1,
min_memory=4,
max_memory=4,
num_gpus=0,
instascale=False,
machine_types = ["m4.xlarge", "g4dn.xlarge"]
)
```
In addition to instantiating our cluster object, this will also write a file, `jobtest.yaml`, to your working directory. This file defines an AppWrapper custom resource; everything MCAD needs to deploy your Ray cluster.
Next, we can apply this YAML file and spin up our Ray cluster.
```python
cluster.up()
cluster.wait_ready()
```
You can also check the cluster details with:
```python
cluster.details()
```
You can check the status of the Ray cluster and see when its ready to use with:
```Python
cluster.status()
```
Once the cluster is up, you are ready to submit your first job.
We are going to use the CodeFlare SDK to submit batch jobs via TorchX, either to the Ray cluster we have just brought up, or directly to MCAD.
First, let's begin by submitting to Ray, training a basic NN on the MNIST dataset:
The `mnist.py` file used comes from [here](https://github.com/opendatahub-io/distributed-workloads/blob/main/tests/resources/mnist.py), which is accessed in your jupyter notebook under `codeflare-sdk/demo-notebooks/guided-demos/mnist.py`
```python
jobdef = DDPJobDefinition(
name="mnisttest",
script="mnist.py",
scheduler_args={"requirements": "requirements.txt"}
)
job = jobdef.submit(cluster)
```
Once the job is submitted you can follow it on the Ray dashboard using the following commands to output the job status directly into your notebook:
```python
cluster.cluster_dashboard_uri()
```
```python
job.status()
```
```python
job.logs()
```
Finally, once the job is done you can shutdown your Ray nodes.
```python
cluster.down()
```
Great! With these guided demos, you have now submitted your first distributed training job with CodeFlare!
Now, an alternative option for job submission is to submit directly to MCAD, which will schedule pods to run the job with requested resources:
```python
jobdef = DDPJobDefinition(
name="mnistjob",
script="mnist.py",
scheduler_args={"namespace": "default"},
j="1x1",
gpu=0,
cpu=1,
memMB=8000,
image="quay.io/project-codeflare/mnist-job-test:v0.0.1"
)
job = jobdef.submit()
```
Once again, we can look at job status and logs as performed previously.
To finalize, the following commands can also be used to delete jobs early for both Ray and MCAD submission and logout and free up the resources on your cluster:
```python
job.cancel()
auth.logout()
```
There are a number of guided demos you can follow to become familiar with the CodeFlare-SDK and the CodeFlare stack. Navigate to the path: `codeflare-sdk/demo-notebooks/guided-demos` to see and run the latest demos.
## Cleaning up the CodeFlare Install
To completely clean up all the CodeFlare components after an install, follow these steps:
Expand Down

0 comments on commit dcc99ce

Please sign in to comment.