-
Notifications
You must be signed in to change notification settings - Fork 538
[DOCS] Apiserver improve docs readability #3564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[DOCS] Apiserver improve docs readability #3564
Conversation
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
9832ea2
to
3d6d9e1
Compare
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
-e Signed-off-by: machichima <[email protected]>
Hi @dentiny , would you mind taking a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort, looks really great!
Just finish first iteration, will take a deep look later tonight.
Left some comments and questions for discussion :)
apiserver/CreatingServe.md
Outdated
test-cluster-head-svc ClusterIP 10.96.19.185 <none> 8265/TCP,52365/TCP,10001/TCP,8080/TCP,6379/TCP,8000/TCP | ||
test-cluster-serve-svc ClusterIP 10.96.144.162 <none> 8000/TCP | ||
``` | ||
Note that the 52365 port for head node service is for serve configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind explain a bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I think this is not clear. I updated to this:
Note that we set the 52365 port for dashboard agent in the above curl command, which is used internally by Ray Serve.
ensure a high availability Global Control Service (GCS) data. The GCS manages | ||
cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A | ||
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance, | ||
we should have a highly available Redis so that when GCS restart, it can resume its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to make sure ray's HA mode relies on redis, but not necessarily HA redis, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer using the word "redis" instead of "HA redis" here, these are two different products from cloud vendor offering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redis itself doesn't provide persistence as well.
"RAY_gcs_rpc_server_reconnect_timeout_s": "300" | ||
} | ||
}, | ||
kubectl exec -it ha-cluster-head -- python3 /home/ray/samples/detached_actor.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script detached_actor.py
is defined in config map and mounted to head node, not in local. Let me add some description here
apiserver/JobSubmission.md
Outdated
@@ -186,8 +204,7 @@ This should return the list of the submissions, that looks as follows: | |||
```json | |||
{ | |||
"submissions":[ | |||
{ | |||
"entrypoint":"python /home/ray/samples/sample_code.py", | |||
{ "entrypoint":"python /home/ray/samples/sample_code.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I copied the result from my terminal, and it looks like it does not format well. Just fixed
apiserver/Monitoring.md
Outdated
## Monitoring of the API server | ||
In order to ensure a proper functioning of the API server and created RayClusters, it is | ||
typically necessary to monitor them. This document describes how to monitor both API | ||
server and created clusters with Prometheus and Grafana |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a "." at the end of the sentence
Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
@@ -17,7 +17,7 @@ You could build and start apiserver from scratch on your local environment in on | |||
make start-local-apiserver | |||
``` | |||
|
|||
apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests. | |||
Apiserver supports HTTP request, so you could easily check whether it's started successfully by issuing two simple curl requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L4, I don't think we're going to provide grpc API in V2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think V2 is in apiserversdk/
and the one here is V1?
Should we also take gRPC out from here?
Signed-off-by: machichima <[email protected]>
apiserver/Autoscaling.md
Outdated
``` | ||
|
||
Alternatively, you could build and deploy the Operator and API server from local repo for | ||
development purpose. | ||
### IMPORTANT: Change your working directory to `apiserver/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it's something worth noticing, do you think it better to
> [!IMPORTANT]
> Change your working directory to `apiserver/`; All the following guidance require you to switch your working directory to the KubeRay
`apiserver`
apiserver/CreatingServe.md
Outdated
Use following command to create a compute template and a RayCluster with RayService support: | ||
|
||
```sh | ||
cur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is cur
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this is added in accident
apiserver/HACluster.md
Outdated
The RayCluster with high availability can also be created in API server, which aims to | ||
ensure a high availability Global Control Service (GCS) data. The GCS manages | ||
cluster-level metadata by storing all data in memory, which is lack of fault tolerance. A | ||
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
single failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance, | |
single head node failure can cause the entire RayCluster to fail. To enable GCS's fault tolerance, |
```json | ||
"environment": { | ||
"values": { | ||
"RAY_REDIS_ADDRESS": "redis.redis.svc.cluster.local:6379" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious (I never used HA ray before), why redis.redis.
instead of redis.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the first redis
is the service name, while the second is the namespace (which also called redis
)
apiserver/JobSubmission.md
Outdated
@@ -134,7 +101,7 @@ This should return JSON similar to the one below | |||
{ | |||
"entrypoint":"python /home/ray/samples/sample_code.py", | |||
"jobId":"02000000", | |||
"submissionId":"raysubmit_KWZLwme56esG3Wcr", | |||
"submissionId":"<submissionID>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think leaving a submission id here is not bad because users could easily see what sub-id looks like; it's not sensitive information related to security as well.
No strong opinion though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem!
apiserver/Monitoring.md
Outdated
Refer to [README](README.md) for setting up KubeRay operator and API server. This will | ||
set the flag `collectMetricsFlag` to `true` which enable the metrics collection. | ||
|
||
### IMPORTANT: Change your working directory to project root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: machichima <[email protected]>
@rueian can you also take a look at this PR? Thanks! |
|
||
```shell | ||
```sh | ||
# Create compute tempalte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Create compute tempalte | |
# Create compute template |
|
||
Once they are set up, you first need to create a Ray cluster using the following commands: | ||
Before running the example, you need to first deploy a RayCluster with following command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before running the example, you need to first deploy a RayCluster with following command. | |
Before running the example, you need to first deploy a RayCluster with the following command. |
test-cluster-head-pr25j 2/2 Running 0 2m49s | ||
### Validate that RayCluster is deployed correctly | ||
|
||
Run following command to get list of pods running. You should see something like below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run following command to get list of pods running. You should see something like below: | |
Run the following command to get a list of pods running. You should see something like below: |
|
||
Create a detached actor: | ||
Create a detached actor to trigger scale-up with following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a detached actor to trigger scale-up with following command: | |
Create a detached actor to trigger scale-up with the following command: |
@@ -165,16 +144,28 @@ curl -X POST 'localhost:31888/apis/v1/namespaces/default/jobs' \ | |||
}' | |||
``` | |||
|
|||
A worker Pod will be deleted after `idleTimeoutSeconds` (default 60s, we specified 30) seconds. Run: | |||
While actor is deleted, we do not need the worker anymore. The worker pod will be deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While actor is deleted, we do not need the worker anymore. The worker pod will be deleted | |
While the actor is deleted, we do not need the worker anymore. The worker pod will be deleted |
|
||
The following environment variable have to be added here: | ||
Run following command for creating a detached actor. Please change `ha-cluster-head` to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run following command for creating a detached actor. Please change `ha-cluster-head` to | |
Run the following command for creating a detached actor. Please change `ha-cluster-head` to |
``` | ||
|
||
Once this is done, open Ray dashboard (using port-forward). In the cluster tab you should see 2 nodes and in the | ||
Actor's pane you should see created actor. | ||
Note that only head node will be recreated, while the worker node stays as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that only head node will be recreated, while the worker node stays as is. | |
Note that only the head node will be recreated, while the worker node stays as is. |
|
||
### Deploy Ray cluster | ||
We will use this [ConfigMap] which contains code for our example. Please download the | ||
config map and deploy it with following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config map and deploy it with following command: | |
config map and deploy it with the following command: |
``` | ||
|
||
Note that this cluster is mounting a volume from a configmap. This config map should be created | ||
prior to cluster creation using [this YAML]. | ||
To check if the RayCluster setup correctly, list all pods with following command. You can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check if the RayCluster setup correctly, list all pods with following command. You can | |
To check if the RayCluster setup correctly, list all pods with the following command. You can |
[command](../install/prometheus/install.sh). The script additionally creates `ray-head-monitor` and | ||
`ray-workers-monitor` in the `prometheus-system` namespace, that we do not need. We can delete them using: | ||
> [!IMPORTANT] | ||
> All the following guidance require you to switch your working directory to the KubeRay project root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
> All the following guidance require you to switch your working directory to the KubeRay project root | |
> All the following guidance requires you to switch your working directory to the KubeRay project root |
Mainly update following docs to make the example easy to follow:
Add dangling docs to README's "Advanced Usage" section
Why are these changes needed?
Some of the documents for apiserver is a bit hard to follow, we should improve their readability.
Related issue number
Checks