Skip to content

[DOCS] Apiserver improve docs readability #3564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

machichima
Copy link
Contributor

@machichima machichima commented May 8, 2025

  • Mainly update following docs to make the example easy to follow:

    • Autoscaling.md
    • CreatingServe.md
    • HACluster.md
    • JobSubmission.md
    • Monitoring.md
    • SecuringImplementation.md
  • Add dangling docs to README's "Advanced Usage" section

Why are these changes needed?

Some of the documents for apiserver is a bit hard to follow, we should improve their readability.

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@machichima machichima force-pushed the apiserver-improve-docs-readability branch from 9832ea2 to 3d6d9e1 Compare May 9, 2025 15:26
-e
Signed-off-by: machichima <[email protected]>
-e
Signed-off-by: machichima <[email protected]>
-e
Signed-off-by: machichima <[email protected]>
-e
Signed-off-by: machichima <[email protected]>
@machichima machichima marked this pull request as ready for review May 10, 2025 05:41
@machichima
Copy link
Contributor Author

Hi @dentiny , would you mind taking a look at this?
Thanks!

@dentiny dentiny added apiserver docs Improvements or additions to documentation labels May 10, 2025
@dentiny dentiny self-assigned this May 10, 2025
Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort, looks really great!
Just finish first iteration, will take a deep look later tonight.
Left some comments and questions for discussion :)

test-cluster-head-svc ClusterIP 10.96.19.185 <none> 8265/TCP,52365/TCP,10001/TCP,8080/TCP,6379/TCP,8000/TCP
test-cluster-serve-svc ClusterIP 10.96.144.162 <none> 8000/TCP
```
Note that the 52365 port for head node service is for serve configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind explain a bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I think this is not clear. I updated to this:

Note that we set the 52365 port for dashboard agent in the above curl command, which is used internally by Ray Serve.

ensure a high availability Global Control Service (GCS) data. The GCS manages
cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,
we should have a highly available Redis so that when GCS restart, it can resume its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure ray's HA mode relies on redis, but not necessarily HA redis, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer using the word "redis" instead of "HA redis" here, these are two different products from cloud vendor offering.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redis itself doesn't provide persistence as well.

"RAY_gcs_rpc_server_reconnect_timeout_s": "300"
}
},
kubectl exec -it ha-cluster-head -- python3 /home/ray/samples/detached_actor.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local script?

Copy link
Contributor Author

@machichima machichima May 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script detached_actor.py is defined in config map and mounted to head node, not in local. Let me add some description here

@@ -186,8 +204,7 @@ This should return the list of the submissions, that looks as follows:
```json
{
"submissions":[
{
"entrypoint":"python /home/ray/samples/sample_code.py",
{ "entrypoint":"python /home/ray/samples/sample_code.py",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I copied the result from my terminal, and it looks like it does not format well. Just fixed

## Monitoring of the API server
In order to ensure a proper functioning of the API server and created RayClusters, it is
typically necessary to monitor them. This document describes how to monitor both API
server and created clusters with Prometheus and Grafana
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a "." at the end of the sentence

@@ -17,7 +17,7 @@ You could build and start apiserver from scratch on your local environment in on
make start-local-apiserver
```

apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests.
Apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L4, I don't think we're going to provide grpc API in V2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think V2 is in apiserversdk/ and the one here is V1?
Should we also take gRPC out from here?

@machichima machichima requested a review from dentiny May 16, 2025 12:15
```

Alternatively, you could build and deploy the Operator and API server from local repo for
development purpose.
### IMPORTANT: Change your working directory to `apiserver/`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it's something worth noticing, do you think it better to

> [!IMPORTANT]
> Change your working directory to `apiserver/`; All the following guidance require you to switch your working directory to the KubeRay
`apiserver`

Use following command to create a compute template and a RayCluster with RayService support:

```sh
cur
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is cur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this is added in accident

The RayCluster with high availability can also be created in API server, which aims to
ensure a high availability Global Control Service (GCS) data. The GCS manages
cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,
single head node failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,

```json
"environment": {
"values": {
"RAY_REDIS_ADDRESS": "redis.redis.svc.cluster.local:6379"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious (I never used HA ray before), why redis.redis. instead of redis.?

Copy link
Contributor Author

@machichima machichima May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first redis is the service name, while the second is the namespace (which also called redis)

@@ -134,7 +101,7 @@ This should return JSON similar to the one below
{
"entrypoint":"python /home/ray/samples/sample_code.py",
"jobId":"02000000",
"submissionId":"raysubmit_KWZLwme56esG3Wcr",
"submissionId":"<submissionID>",
Copy link
Contributor

@dentiny dentiny May 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think leaving a submission id here is not bad because users could easily see what sub-id looks like; it's not sensitive information related to security as well.
No strong opinion though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem!

Refer to [README](README.md) for setting up KubeRay operator and API server. This will
set the flag `collectMetricsFlag` to `true` which enable the metrics collection.

### IMPORTANT: Change your working directory to project root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@dentiny dentiny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@kevin85421
Copy link
Member

@rueian can you also take a look at this PR? Thanks!


```shell
```sh
# Create compute tempalte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create compute tempalte
# Create compute template


Once they are set up, you first need to create a Ray cluster using the following commands:
Before running the example, you need to first deploy a RayCluster with following command.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Before running the example, you need to first deploy a RayCluster with following command.
Before running the example, you need to first deploy a RayCluster with the following command.

test-cluster-head-pr25j 2/2 Running 0 2m49s
### Validate that RayCluster is deployed correctly

Run following command to get list of pods running. You should see something like below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run following command to get list of pods running. You should see something like below:
Run the following command to get a list of pods running. You should see something like below:


Create a detached actor:
Create a detached actor to trigger scale-up with following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Create a detached actor to trigger scale-up with following command:
Create a detached actor to trigger scale-up with the following command:

@@ -165,16 +144,28 @@ curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \
}'
```

A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s, we specified 30) seconds. Run:
While actor is deleted, we do not need the worker anymore. The worker pod will be deleted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
While actor is deleted, we do not need the worker anymore. The worker pod will be deleted
While the actor is deleted, we do not need the worker anymore. The worker pod will be deleted


The following environment variable have to be added here:
Run following command for creating a detached actor. Please change `ha-cluster-head` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run following command for creating a detached actor. Please change `ha-cluster-head` to
Run the following command for creating a detached actor. Please change `ha-cluster-head` to

```

Once this is done, open Ray dashboard (using port-forward). In the cluster tab you should see 2 nodes and in the
Actor's pane you should see created actor.
Note that only head node will be recreated, while the worker node stays as is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that only head node will be recreated, while the worker node stays as is.
Note that only the head node will be recreated, while the worker node stays as is.


### Deploy Ray cluster
We will use this [ConfigMap] which contains code for our example. Please download the
config map and deploy it with following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
config map and deploy it with following command:
config map and deploy it with the following command:

```

Note that this cluster is mounting a volume from a configmap. This config map should be created
prior to cluster creation using [this YAML].
To check if the RayCluster setup correctly, list all pods with following command. You can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To check if the RayCluster setup correctly, list all pods with following command. You can
To check if the RayCluster setup correctly, list all pods with the following command. You can

[command](../install/prometheus/install.sh). The script additionally creates `ray-head-monitor` and
`ray-workers-monitor` in the `prometheus-system` namespace, that we do not need. We can delete them using:
> [!IMPORTANT]
> All the following guidance require you to switch your working directory to the KubeRay project root
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
> All the following guidance require you to switch your working directory to the KubeRay project root
> All the following guidance requires you to switch your working directory to the KubeRay project root

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apiserver docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants