Skip to content

Commit

Permalink
Cleaned up the data processing README
Browse files Browse the repository at this point in the history
  • Loading branch information
arueth committed Mar 29, 2024
1 parent bbff3bb commit d7330e8
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 79 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -21,91 +21,113 @@ The preprocessing.py file does the following:

## How to use this repo:

1. Clone the repository and change directory to the guide directory

```
git clone https://github.com/GoogleCloudPlatform/ai-on-gke && \
cd ai-on-gke/best-practices/ml-platform/examples/use-case/ray/dataprocessing
```

1. Set environment variables

```
PROJECT_ID=<your_project_id>
PROCESSING_BUCKET=<your_bucket_name>
DOCKER_IMAGE_URL=us-docker.pkg.dev/${PROJECT_ID}/dataprocessing/dp:v0.0.1
```
```
CLUSTER_NAME=<your_cluster_name>
PROJECT_ID=<your_project_id>
PROCESSING_BUCKET=<your_bucket_name>
DOCKER_IMAGE_URL=us-docker.pkg.dev/${PROJECT_ID}/dataprocessing/dp:v0.0.1
```

2. Create a Cloud Storage bucket to store raw data
1. Create a Cloud Storage bucket to store raw data

```
gcloud storage buckets create gs://${PROCESSING_BUCKET} --project ${PROJECT_ID}
```
```
gcloud storage buckets create gs://${PROCESSING_BUCKET} --project ${PROJECT_ID}
```

3. Download the raw data csv file from above and store into the bucket created in the previous step.
1. Download the raw data csv file from above and store into the bucket created in the previous step.
The kaggle cli can be installed using the following [instructions](https://github.com/Kaggle/kaggle-api#installation)
To use the cli you must create an API token (Kaggle > User Profile > API > Create New Token), the downloaded file should be stored in HOME/.kaggle/kaggle.json.
Alternatively, it can be [downloaded](https://www.kaggle.com/datasets/atharvjairath/flipkart-ecommerce-dataset) from the kaggle website

```
kaggle datasets download --unzip atharvjairath/flipkart-ecommerce-dataset
gcloud storage cp flipkart_com-ecommerce_sample.csv \
gs://${PROCESSING_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv
```
```
kaggle datasets download --unzip atharvjairath/flipkart-ecommerce-dataset && \
gcloud storage cp flipkart_com-ecommerce_sample.csv \
gs://${PROCESSING_BUCKET}/flipkart_raw_dataset/flipkart_com-ecommerce_sample.csv
```

4. Provide respective GCS bucket access rights to GKE Kubernetes Service Accounts.
1. Provide respective GCS bucket access rights to GKE Kubernetes Service Accounts.
Ray head with access to read the raw source data in the storage bucket
Ray worker(s) with the access to write data to the storage bucket.

```
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member "serviceAccount:wi-ml-team-ray-head@${PROJECT_ID}.iam.gserviceaccount.com" \
--role roles/storage.objectViewer
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member "serviceAccount:wi-ml-team-ray-worker@${PROJECT_ID}.iam.gserviceaccount.com" \
--role roles/storage.objectAdmin
```

5. Create Artifact Registry repository for your docker image

```
gcloud artifacts repositories create dataprocessing \
--repository-format=docker \
--location=us \
--project=${PROJECT_ID} \
--async
```

6. Build container image using Cloud Build and push the image to Artifact Registry

```
gcloud builds submit . \
--tag ${DOCKER_IMAGE_URL}:v0.0.1
```

7. Update respective variables in the Job submission manifest to reflect your configuration.
a. Image is the docker image that was built in the previous step
b. Processing bucket is the location of the GCS bucket where the source data and results will be stored
c. Ray Cluster Host - if used in this example, it should not need to be changed, but if your Ray cluster service is named differently or in a different namespace, update accordingly.

```
sed -i 's|#IMAGE|${DOCKER_IMAGE_URL}:v0.0.1' job.yaml
sed -i 's|#PROCESSING_BUCKET|${PROCESSING_BUCKET}' job.yaml
```

8. Create the Job in the “ml-team” namespace using kubectl command

```
kubectl apply -f job.yaml -n ml-team
```

9. Monitor the execution in Ray Dashboard
a. Jobs -> Running Job ID
i) See the Tasks/actors overview for Running jobs
ii) See the Task Table for a detailed view of task and assigned node(s)
b. Cluster -> Node List
i) See the Ray actors running on the worker process

10. Once the Job is completed, both the prepared dataset as a CSV and the images are stored in Google Cloud Storage.

```
gcloud storage ls \
gs://${PROCESSING_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv
gcloud storage ls \
gs://${PROCESSING_BUCKET}/flipkart_images
```
```
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member "serviceAccount:wi-ml-team-ray-head@${PROJECT_ID}.iam.gserviceaccount.com" \
--role roles/storage.objectViewer
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member "serviceAccount:wi-ml-team-ray-worker@${PROJECT_ID}.iam.gserviceaccount.com" \
--role roles/storage.objectAdmin
```

1. Create Artifact Registry repository for your docker image

```
gcloud artifacts repositories create dataprocessing \
--repository-format=docker \
--location=us \
--project=${PROJECT_ID} \
--async
```

1. Enable the Cloud Build APIs

```
gcloud services enable cloudbuild.googleapis.com --project ${PROJECT_ID}
```

1. Build container image using Cloud Build and push the image to Artifact Registry

```
cd src && \
gcloud builds submit --tag ${DOCKER_IMAGE_URL} . && \
cd ..
```

1. Update respective variables in the Job submission manifest to reflect your configuration.

- Image is the docker image that was built in the previous step
- Processing bucket is the location of the GCS bucket where the source data and results will be stored
- Ray Cluster Host - if used in this example, it should not need to be changed, but if your Ray cluster service is named differently or in a different namespace, update accordingly.

```
sed -i "s|#IMAGE|${DOCKER_IMAGE_URL}|" job.yaml && \
sed -i "s|#PROCESSING_BUCKET|${PROCESSING_BUCKET}|" job.yaml
```

1. Get credentials for the GKE cluster

```
gcloud container fleet memberships get-credentials ${CLUSTER_NAME}
```

1. Create the Job in the “ml-team” namespace using kubectl command

```
kubectl apply -f job.yaml
```

1. Monitor the execution in Ray Dashboard

- Jobs -> Running Job ID
- See the Tasks/actors overview for Running jobs
- See the Task Table for a detailed view of task and assigned node(s)
- Cluster -> Node List
- See the Ray actors running on the worker process

1. Once the Job is completed, both the prepared dataset as a CSV and the images are stored in Google Cloud Storage.

```
gcloud storage ls gs://${PROCESSING_BUCKET}/flipkart_preprocessed_dataset/flipkart.csv
gcloud storage ls gs://${PROCESSING_BUCKET}/flipkart_images
```

For additional information about converting you code from a notebook to run as a Job on GKE see the [Conversion Guide](CONVERSION.md)
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,11 @@ spec:
spec:
containers:
- name: job
image: us-west2-docker.pkg.dev/cloud-sa-ml/data-processing-repo/image:latest
image: #IMAGE
env:
- name: "PROCESSING_BUCKET"
value: ai-infra-ml-data-processing
value: #PROCESSING_BUCKET
- name: "RAY_CLUSTER_HOST"
value: "ray-cluster-kuberay-head-svc.ml-team:10001"
restartPolicy: Never
serviceAccountName: ray-worker
######################Ray code sample#################################

0 comments on commit d7330e8

Please sign in to comment.