Merge pull request #839 from run-ai/v2.18-RUN-17656-Workload-Policy-Y…

…aml-Documentation V2.18 run 17656 workload policy yaml documentation
run-ai · Jul 4, 2024 · 3303f9f · 3303f9f
2 parents 1420f44 + c65453e
commit 3303f9f
Show file tree

Hide file tree

Showing 14 changed files with 324 additions and 30 deletions.
diff --git a/docs/Researcher/Walkthroughs/walkthrough-build-ports.md b/docs/Researcher/Walkthroughs/walkthrough-build-ports.md
@@ -26,23 +26,21 @@
 
 ``` bash
 runai config project team-a
-runai submit nginx-test -i zembutsu/docker-sample-nginx --interactive \
-  --service-type portforward --port 8080:80 
+runai submit nginx-test -i zembutsu/docker-sample-nginx --interactive
+runai port-forward nginx-test --port 8080:80
 ```
 
 *   The Job is based on a sample _NGINX_ webserver docker image `zembutsu/docker-sample-nginx`. Once accessed via a browser, the page shows the container name. 
 *   Note the _interactive_ flag which means the Job will not have a start or end. It is the Researcher's responsibility to close the Job.  
-*   In this example, we have chosen the simplest scheme to expose ports which is port forwarding. We temporarily expose port 8080 to localhost as long as the `runai submit` command is not stopped
+*   In this example, we have chosen the simplest scheme to expose ports which is port forwarding. We temporarily expose port 8080 to localhost as long as the `runai port-forward` command is not stopped
 *   It is possible to forward traffic from multiple IP addresses by using the "--address" parameter. Check the CLI reference for further details. 
 
 The result will be:
 
 ``` bash
 The job 'nginx-test-0' has been submitted successfully
 You can run `runai describe job nginx-test-0 -p team-a` to check the job status
-Waiting for pod to start running...
-INFO[0023] Job started
-Open access point(s) to service from localhost:8080
+
 Forwarding from 127.0.0.1:8080 -> 80
 Forwarding from [::1]:8080 -> 80
 ```

diff --git a/docs/Researcher/cli-reference/runai-submit-dist-TF.md b/docs/Researcher/cli-reference/runai-submit-dist-TF.md
@@ -100,6 +100,30 @@ runai submit-dist tf --name distributed-job --workers=2 -g 1 \
 
 > Set labels variables in the container.
 
+#### --master-args string `<string>`
+
+>  Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
+
+#### --master-environment `<stringArray>`
+
+>  Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
+
+#### --master-extended-resource `<stringArray>`
+
+>  Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
+
+#### --master-gpu `<float>`
+
+>  GPU units to allocate for the master pod.
+
+#### --master-no-pvcs
+
+>  Do not mount any persistent volumes in the master pod.
+
+#### --no-master
+
+>  Do not create a separate pod for the master.
+
 #### --preferred-pod-topology-key `<string>`
 
 > If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

diff --git a/docs/Researcher/cli-reference/runai-submit-dist-mpi.md b/docs/Researcher/cli-reference/runai-submit-dist-mpi.md
@@ -103,6 +103,26 @@ You can start an unattended mpi training Job of name dist1, based on Project *te
 
 > Set labels variables in the container.
 
+#### --master-args string `<string>`
+
+>  Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
+
+#### --master-environment `<stringArray>`
+
+>  Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
+
+#### --master-extended-resource `<stringArray>`
+
+>  Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
+
+#### --master-gpu `<float>`
+
+>  GPU units to allocate for the master pod.
+
+#### --master-no-pvcs
+
+>  Do not mount any persistent volumes in the master pod.
+
 #### --preferred-pod-topology-key `<string>`
 
 > If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

diff --git a/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md b/docs/Researcher/cli-reference/runai-submit-dist-pytorch.md
@@ -91,7 +91,7 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
   
 #### --image `<string>` | -i `<string>`
 
->  Image to use when creating the container for this Job
+>  Image to use when creating the container for this Job.
 
 #### --image-pull-policy `<string>`
 
@@ -107,6 +107,30 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
 
 > Set labels variables in the container.
 
+#### --master-args string `<string>`
+
+>  Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
+
+#### --master-environment `<stringArray>`
+
+>  Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
+
+#### --master-extended-resource `<stringArray>`
+
+>  Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
+
+#### --master-gpu `<float>`
+
+>  GPU units to allocate for the master pod.
+
+#### --master-no-pvcs
+
+>  Do not mount any persistent volumes in the master pod.
+
+#### --no-master
+
+>  Do not create a separate pod for the master.
+
 #### --preferred-pod-topology-key `<string>`
 
 > If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.
@@ -361,9 +385,4 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
 
 ## Output
 
-The command will attempt to submit an _mpi_ Job. You can follow up on the Job by running `runai list jobs` or `runai describe job <job-name>`.
-
-## See Also
-
-< please let me know if this is needed, or if additional documentation is needed in the link >
-*   See Quickstart document [Running Distributed Training](../Walkthroughs/walkthrough-distributed-training.md).
+The command will attempt to submit a _distributed pytorch_ workload. You can follow up on the workload by running `runai list jobs` or `runai describe job <job-name>`.
diff --git a/docs/Researcher/cli-reference/runai-submit-dist-xgboost.md b/docs/Researcher/cli-reference/runai-submit-dist-xgboost.md
@@ -95,6 +95,26 @@ runai submit-dist xgboost --name distributed-job --workers=2 -g 1 \
 
 > Set labels variables in the container.
 
+#### --master-args string `<string>`
+
+>  Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
+
+#### --master-environment `<stringArray>`
+
+>  Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
+
+#### --master-extended-resource `<stringArray>`
+
+>  Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
+
+#### --master-gpu `<float>`
+
+>  GPU units to allocate for the master pod.
+
+#### --master-no-pvcs
+
+>  Do not mount any persistent volumes in the master pod.
+
 #### --preferred-pod-topology-key `<string>`
 
 > If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.

diff --git a/docs/admin/runai-setup/access-control/rbac.md b/docs/admin/runai-setup/access-control/rbac.md
@@ -60,7 +60,7 @@ RBAC uses [rules](#access-rules) to ensure that only authorized users or applica
 
 * Departments
 * Projects
-* Deployments
+* Inference
 * Workspaces
 * Environments
 * Quota management dashboard

diff --git a/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md b/docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
@@ -64,6 +64,8 @@ Following is a Kubernetes support matrix for the latest Run:ai releases:<a name=
 | Run:ai 2.17    | 1.27 through 1.29  | 4.12 through 4.15 |
 | Run:ai 2.18    | 1.28 through 1.30  | 4.12 through 4.15 |
 
+For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. Within these notes, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with Run:ai.
+
 For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.
 
 !!! Note

diff --git a/docs/admin/runai-setup/self-hosted/k8s/backend.md b/docs/admin/runai-setup/self-hosted/k8s/backend.md
@@ -1,5 +1,5 @@
 
-# Install the Run:ai Control Plane 
+# Install the Run:ai Control Plane
 
 ## Prerequisites and preperations
 
@@ -16,7 +16,7 @@ Run the helm command below:
     helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.17.0" \
         --set global.domain=<DOMAIN>  # (1)
     ```
-    
+
     1. Domain name described [here](prerequisites.md#domain-name). 
 
     !!! Info
@@ -29,18 +29,17 @@ Run the helm command below:
         --set global.customCA.enabled=true \  # (3)
         -n runai-backend -f custom-env.yaml  # (4)
     ```
-       
+
     1. Replace `<VERSION>` with the Run:ai control plane version.
     2. Domain name described [here](prerequisites.md#domain-name). 
     3. See the Local Certificate Authority instructions below
     4. `custom-env.yaml` should have been created by the _prepare installation_ script in the previous section. 
 
 !!! Tip
-    Use the  `--dry-run` flag to gain an understanding of what is being installed before the actual installation. 
-
-
+    Use the  `--dry-run` flag to gain an understanding of what is being installed before the actual installation.
 
 ### Additional configurations (optional)
+
 There may be cases where you need to set additional properties as follows:
 
 |  Key     | Change   | Description |
@@ -62,8 +61,14 @@ There may be cases where you need to set additional properties as follows:
 | `grafana.dbPassword`  | Grafana database password | Password for the Grafana database user |
 | `grafana.adminUser`  | Grafana username  |   Override the Run:ai default user name for accessing Grafana |
 | `grafana.adminPassword`  | Grafana password  |   Override the Run:ai default password for accessing Grafana |
+| `grafana.dbUser`  | Grafana's username for PostgreSQL  |   Override the Run:ai default user name for Grafana to access Run:ai database (PostgreSQL) |
+| `grafana.dbPassword`  | Grafana's password for PostgreSQL |   Override the Run:ai default password for Grafana to access Run:ai database (PostgreSQL) |
+| `grafana.grafana.ini.database.user`  | Reference to Grafana's username for PostgreSQL  |  Don't override this value |
+| `grafana.grafana.ini.database.password`  | Reference to Grafana's password for PostgreSQL |   Don't override this value |
+| `tenantsManager.config.adminUsername`  | Run:ai first admin username |   Override the default user name of the first admin user created with Run:ai |
+| `tenantsManager.config.adminPassword`  | Run:ai first admin user's password |   Override the default password of the first admin user created with Run:ai |
 | `thanos.receive.persistence.storageClass` and `postgresql.primary.persistence.storageClass` | Storage class | The installation to work with a specific storage class rather than the default one |
-| `<component>` <br> &ensp;`resources:` <br> &emsp; `limits:` <br> &emsp; &ensp; `cpu: 500m` <br> &emsp; &ensp; `memory: 512Mi` <br> &emsp; `requests:` <br> &emsp; &ensp; `cpu: 250m` <br> &emsp; &ensp; `memory: 256Mi`  | Pod request and limits  |  `<component>` may be anyone of the following: `backend`, `frontend`, `assetsService`, `identityManager`, `tenantsManager`, `keycloakx`, `grafana`, `authorization`, `orgUnitService`,`policyService`  |   
+| `<component>` <br> &ensp;`resources:` <br> &emsp; `limits:` <br> &emsp; &ensp; `cpu: 500m` <br> &emsp; &ensp; `memory: 512Mi` <br> &emsp; `requests:` <br> &emsp; &ensp; `cpu: 250m` <br> &emsp; &ensp; `memory: 256Mi`  | Pod request and limits  |  `<component>` may be anyone of the following: `backend`, `frontend`, `assetsService`, `identityManager`, `tenantsManager`, `keycloakx`, `grafana`, `authorization`, `orgUnitService`,`policyService`  |
 |<div style="width:200px"></div>| | |
 
 Use the `--set` syntax in the helm command above.  
@@ -80,24 +85,38 @@ If you have opted to connect to an [external PostgreSQL database](preperations.m
 * `grafana.dbUser`
 * `grafana.dbPassword`
 
+#### External PostgreSQL database
+
+If you have opted to connect to an [external PostgreSQL database](preperations.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:
+
+* `postgresql.enabled` - set to `false`
+* `global.postgresql.auth.password`
+* `global.postgresql.auth.username`
+* `global.postgresql.auth.host`
+* `global.postgresql.auth.port`
+* `grafana.dbUser`
+* `grafana.dbPassword`
+
+!!! Note
+    If you modify one of the usernames or passwords (KeyCloak, PostgreSQL, Grafana) after Run:ai is already installed, perform the following steps to apply the change:
+
+    1. Modify the username/password within the relevant component as well (KeyCloak, PostgreSQL, Grafana).
+    2. Run `helm upgrade` for Run:ai with the right values, and restart the relevant Run:ai pods so they can fetch the new username/password.
+
 ## Next Steps
 
 ### Connect to Run:ai User interface
 
-Go to: `runai.<domain>`. Log in using the default credentials: User: `[email protected]`, Password: `Abcd!234`. Go to the Users area and change the password. 
+Go to: `runai.<domain>`. Log in using the default credentials: User: `[email protected]`, Password: `Abcd!234`. Go to the Users area and change the password.
 
 ### Enable Forgot Password (optional)
 
 To support the *Forgot password* functionality, follow the steps below.
 
-* Go to `runai.<domain>/auth` and Log in. 
+* Go to `runai.<domain>/auth` and Log in.
 * Under `Realm settings`, select the `Login` tab and enable the `Forgot password` feature.
 * Under the `Email` tab, define an SMTP server, as explained [here](https://www.keycloak.org/docs/latest/server_admin/#_email){target=_blank}
 
-
 ### Install Run:ai Cluster
-Continue with installing a [Run:ai Cluster](cluster.md).
-
-
-
 
+Continue with installing a [Run:ai Cluster](cluster.md).
diff --git a/docs/admin/runai-setup/self-hosted/k8s/preparations.md b/docs/admin/runai-setup/self-hosted/k8s/preparations.md
@@ -48,7 +48,7 @@ Follow the prerequisites as explained in [Self-Hosted installation over Kubernet
     Run the following script (you must dockerd installed and at least 20GB of free disk space to run): 
 
     ```  
-    ./setup.sh
+    sudo -E ./prepare_installation.sh
     ```
 
     If Docker is configured to [run as non-root](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user){target=_blank} then `sudo` is not required.