[DOCS] Apiserver improve docs readability #3564

machichima · 2025-05-08T15:15:05Z

Mainly update following docs to make the example easy to follow:
- Autoscaling.md
- CreatingServe.md
- HACluster.md
- JobSubmission.md
- Monitoring.md
- SecuringImplementation.md
Add dangling docs to README's "Advanced Usage" section

Why are these changes needed?

Some of the documents for apiserver is a bit hard to follow, we should improve their readability.

Related issue number

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

-e Signed-off-by: machichima <[email protected]>

machichima · 2025-05-10T05:46:57Z

Hi @dentiny , would you mind taking a look at this?
Thanks!

dentiny

Thanks for the effort, looks really great!
Just finish first iteration, will take a deep look later tonight.
Left some comments and questions for discussion :)

apiserver/Autoscaling.md

dentiny · 2025-05-11T11:04:24Z

apiserver/CreatingServe.md

-test-cluster-head-svc    ClusterIP   10.96.19.185    <none>        8265/TCP,52365/TCP,10001/TCP,8080/TCP,6379/TCP,8000/TCP
-test-cluster-serve-svc   ClusterIP   10.96.144.162   <none>        8000/TCP
-```
+Note that the 52365 port for head node service is for serve configuration.


Do you mind explain a bit?

Sorry I think this is not clear. I updated to this:

Note that we set the 52365 port for dashboard agent in the above curl command, which is used internally by Ray Serve.

dentiny · 2025-05-11T11:05:35Z

apiserver/HACluster.md

+ensure a high availability Global Control Service (GCS) data. The GCS manages
+cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A
+single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,
+we should have a highly available Redis so that when GCS restart, it can resume its


Just want to make sure ray's HA mode relies on redis, but not necessarily HA redis, right?

I prefer using the word "redis" instead of "HA redis" here, these are two different products from cloud vendor offering.

redis itself doesn't provide persistence as well.

dentiny · 2025-05-11T11:09:35Z

apiserver/HACluster.md

-             "RAY_gcs_rpc_server_reconnect_timeout_s": "300"
-           }
-        },
+kubectl exec -it ha-cluster-head -- python3 /home/ray/samples/detached_actor.py


Local script?

This script detached_actor.py is defined in config map and mounted to head node, not in local. Let me add some description here

dentiny · 2025-05-11T11:12:50Z

apiserver/JobSubmission.md

@@ -186,8 +204,7 @@ This should return the list of the submissions, that looks as follows:
 ```json
 {
   "submissions":[
-      {
-         "entrypoint":"python /home/ray/samples/sample_code.py",
+      { "entrypoint":"python /home/ray/samples/sample_code.py",


why this change?

Sorry I copied the result from my terminal, and it looks like it does not format well. Just fixed

dentiny · 2025-05-11T11:13:41Z

apiserver/Monitoring.md

-## Monitoring of the API server
+In order to ensure a proper functioning of the API server and created RayClusters, it is
+typically necessary to monitor them. This document describes how to monitor both API
+server and created clusters with Prometheus and Grafana


nit: add a "." at the end of the sentence

Signed-off-by: machichima <[email protected]>

dentiny · 2025-05-12T05:27:05Z

apiserver/README.md

@@ -17,7 +17,7 @@ You could build and start apiserver from scratch on your local environment in on
 make start-local-apiserver
 ```

-apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests.
+Apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests.


L4, I don't think we're going to provide grpc API in V2

I think V2 is in apiserversdk/ and the one here is V1?
Should we also take gRPC out from here?

Signed-off-by: machichima <[email protected]>

dentiny · 2025-05-18T23:21:48Z

apiserver/Autoscaling.md

 ```

-Alternatively, you could build and deploy the Operator and API server from local repo for
-development purpose.
+### IMPORTANT: Change your working directory to `apiserver/`


If you think it's something worth noticing, do you think it better to

> [!IMPORTANT] > Change your working directory to `apiserver/`; All the following guidance require you to switch your working directory to the KubeRay `apiserver`

dentiny · 2025-05-18T23:32:32Z

apiserver/CreatingServe.md

+Use following command to create a compute template and a RayCluster with RayService support:
+
+```sh
+cur


what is cur?

Sorry this is added in accident

dentiny · 2025-05-18T23:39:17Z

apiserver/HACluster.md

+The RayCluster with high availability can also be created in API server, which aims to
+ensure a high availability Global Control Service (GCS) data. The GCS manages
+cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A
+single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,


Suggested change

single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,

single head node failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,

dentiny · 2025-05-18T23:40:50Z

apiserver/HACluster.md

+```json
+"environment": {
+    "values": {
+        "RAY_REDIS_ADDRESS": "redis.redis.svc.cluster.local:6379"


just curious (I never used HA ray before), why redis.redis. instead of redis.?

I think the first redis is the service name, while the second is the namespace (which also called redis)

dentiny · 2025-05-18T23:43:25Z

apiserver/JobSubmission.md

@@ -134,7 +101,7 @@ This should return JSON similar to the one below
 {
   "entrypoint":"python /home/ray/samples/sample_code.py",
   "jobId":"02000000",
-   "submissionId":"raysubmit_KWZLwme56esG3Wcr",
+   "submissionId":"<submissionID>",


I think leaving a submission id here is not bad because users could easily see what sub-id looks like; it's not sensitive information related to security as well.
No strong opinion though

No problem!

dentiny · 2025-05-18T23:44:53Z

apiserver/Monitoring.md

+Refer to [README](README.md) for setting up KubeRay operator and API server. This will
+set the flag `collectMetricsFlag` to `true` which enable the metrics collection.
+
+### IMPORTANT: Change your working directory to project root


same comment as https://github.com/ray-project/kuberay/pull/3564/files#r2094653342

dentiny

Thanks!

Signed-off-by: machichima <[email protected]>

kevin85421 · 2025-05-19T17:46:50Z

@rueian can you also take a look at this PR? Thanks!

rueian · 2025-05-19T19:01:37Z

apiserver/Autoscaling.md


-```shell
+```sh
+# Create compute tempalte


Suggested change

# Create compute tempalte

# Create compute template

rueian · 2025-05-19T19:01:47Z

apiserver/Autoscaling.md


-Once they are set up, you first need to create a Ray cluster using the following commands:
+Before running the example, you need to first deploy a RayCluster with following command.


Suggested change

Before running the example, you need to first deploy a RayCluster with following command.

Before running the example, you need to first deploy a RayCluster with the following command.

rueian · 2025-05-19T19:02:26Z

apiserver/Autoscaling.md

-test-cluster-head-pr25j             2/2     Running   0          2m49s
+### Validate that RayCluster is deployed correctly
+
+Run following command to get list of pods running. You should see something like below:


Suggested change

Run following command to get list of pods running. You should see something like below:

Run the following command to get a list of pods running. You should see something like below:

rueian · 2025-05-19T19:02:42Z

apiserver/Autoscaling.md


-Create a detached actor:
+Create a detached actor to trigger scale-up with following command:


Suggested change

Create a detached actor to trigger scale-up with following command:

Create a detached actor to trigger scale-up with the following command:

rueian · 2025-05-19T19:03:25Z

apiserver/Autoscaling.md

@@ -165,16 +144,28 @@ curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \
 }'
 ```

-A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s, we specified 30) seconds. Run:
+While actor is deleted, we do not need the worker anymore. The worker pod will be deleted


Suggested change

While actor is deleted, we do not need the worker anymore. The worker pod will be deleted

While the actor is deleted, we do not need the worker anymore. The worker pod will be deleted

rueian · 2025-05-19T19:12:12Z

apiserver/HACluster.md


-The following environment variable have to be added here:
+Run following command for creating a detached actor. Please change `ha-cluster-head` to


Suggested change

Run following command for creating a detached actor. Please change `ha-cluster-head` to

Run the following command for creating a detached actor. Please change `ha-cluster-head` to

rueian · 2025-05-19T19:12:30Z

apiserver/HACluster.md

 ```

-Once this is done, open Ray dashboard (using port-forward). In the cluster tab you should see 2 nodes and in the
-Actor's pane you should see created actor.
+Note that only head node will be recreated, while the worker node stays as is.


Suggested change

Note that only head node will be recreated, while the worker node stays as is.

Note that only the head node will be recreated, while the worker node stays as is.

rueian · 2025-05-19T19:13:33Z

apiserver/JobSubmission.md


-### Deploy Ray cluster
+We will use this [ConfigMap] which contains code for our example. Please download the
+config map and deploy it with following command:


Suggested change

config map and deploy it with following command:

config map and deploy it with the following command:

rueian · 2025-05-19T19:13:58Z

apiserver/JobSubmission.md

 ```

-Note that this cluster is mounting a volume from a configmap. This config map should be created
-prior to cluster creation using [this YAML].
+To check if the RayCluster setup correctly, list all pods with following command. You can


Suggested change

To check if the RayCluster setup correctly, list all pods with following command. You can

To check if the RayCluster setup correctly, list all pods with the following command. You can

rueian · 2025-05-19T19:15:37Z

apiserver/Monitoring.md

-[command](../install/prometheus/install.sh). The script additionally creates `ray-head-monitor` and
-`ray-workers-monitor` in the `prometheus-system` namespace, that we do not need. We can delete them using:
+> [!IMPORTANT]
+> All the following guidance require you to switch your working directory to the KubeRay project root


Suggested change

> All the following guidance require you to switch your working directory to the KubeRay project root

> All the following guidance requires you to switch your working directory to the KubeRay project root

machichima added 3 commits May 8, 2025 22:10

docs: update apiserver autoscaling docs with example

f6ddc5a

-e Signed-off-by: machichima <[email protected]>

docs: update apiserver create serve docs

f1462dc

-e Signed-off-by: machichima <[email protected]>

docs: update apiserver HACluster docs

3d6d9e1

-e Signed-off-by: machichima <[email protected]>

machichima force-pushed the apiserver-improve-docs-readability branch from 9832ea2 to 3d6d9e1 Compare May 9, 2025 15:26

machichima added 6 commits May 10, 2025 11:09

docs: update apiserver jobsubmission docs

86e7b96

-e Signed-off-by: machichima <[email protected]>

docs: update Monitoring docs

2656969

-e Signed-off-by: machichima <[email protected]>

docs: update securing implementation docs

7273564

-e Signed-off-by: machichima <[email protected]>

docs: update README

3cb8ed5

-e Signed-off-by: machichima <[email protected]>

docs: add cleanup step

7c55999

-e Signed-off-by: machichima <[email protected]>

docs: fix typo

93bc4f4

-e Signed-off-by: machichima <[email protected]>

machichima marked this pull request as ready for review May 10, 2025 05:41

dentiny added apiserver docs Improvements or additions to documentation labels May 10, 2025

dentiny self-assigned this May 10, 2025

dentiny reviewed May 10, 2025

View reviewed changes

apiserver/Autoscaling.md Show resolved Hide resolved

apiserver/Autoscaling.md Show resolved Hide resolved

apiserver/Autoscaling.md Show resolved Hide resolved

apiserver/Autoscaling.md Show resolved Hide resolved

apiserver/Autoscaling.md Show resolved Hide resolved

dentiny reviewed May 11, 2025

View reviewed changes

machichima added 2 commits May 11, 2025 21:38

docs: update unclear description

ac81be2

Signed-off-by: machichima <[email protected]>

docs: add description & fix format

f582078

Signed-off-by: machichima <[email protected]>

dentiny reviewed May 12, 2025

View reviewed changes

docs: move curl json to independent files

a9fb8d4

Signed-off-by: machichima <[email protected]>

dentiny mentioned this pull request May 15, 2025

[DOCS] KubeRay APIServer V2 document #3594

Open

4 tasks

machichima requested a review from dentiny May 16, 2025 12:15

dentiny reviewed May 18, 2025

View reviewed changes

dentiny approved these changes May 18, 2025

View reviewed changes

dentiny requested review from rueian and kevin85421 May 18, 2025 23:49

dentiny assigned rueian and kevin85421 May 18, 2025

docs: fix based on comments

6182e4f

Signed-off-by: machichima <[email protected]>

rueian reviewed May 19, 2025

View reviewed changes

	single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,
	single head node failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance,


		Once they are set up, you first need to create a Ray cluster using the following commands:
		Before running the example, you need to first deploy a RayCluster with following command.

	Run following command to get list of pods running. You should see something like below:
	Run the following command to get a list of pods running. You should see something like below:


		Create a detached actor:
		Create a detached actor to trigger scale-up with following command:

	While actor is deleted, we do not need the worker anymore. The worker pod will be deleted
	While the actor is deleted, we do not need the worker anymore. The worker pod will be deleted


		The following environment variable have to be added here:
		Run following command for creating a detached actor. Please change `ha-cluster-head` to

	Run following command for creating a detached actor. Please change `ha-cluster-head` to
	Run the following command for creating a detached actor. Please change `ha-cluster-head` to

	Note that only head node will be recreated, while the worker node stays as is.
	Note that only the head node will be recreated, while the worker node stays as is.

	config map and deploy it with following command:
	config map and deploy it with the following command:

	To check if the RayCluster setup correctly, list all pods with following command. You can
	To check if the RayCluster setup correctly, list all pods with the following command. You can

	> All the following guidance require you to switch your working directory to the KubeRay project root
	> All the following guidance requires you to switch your working directory to the KubeRay project root

[DOCS] Apiserver improve docs readability #3564

Are you sure you want to change the base?

[DOCS] Apiserver improve docs readability #3564

Conversation

machichima commented May 8, 2025 • edited Loading

Why are these changes needed?

Related issue number

Checks

machichima commented May 10, 2025

dentiny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima May 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima May 19, 2025 • edited Loading

Choose a reason for hiding this comment

dentiny May 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dentiny left a comment

Choose a reason for hiding this comment

kevin85421 commented May 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

machichima commented May 8, 2025 •

edited

Loading

machichima May 11, 2025 •

edited

Loading

machichima May 19, 2025 •

edited

Loading

dentiny May 18, 2025 •

edited

Loading