Skip to content

Commit

Permalink
Merge pull request #839 from run-ai/v2.18-RUN-17656-Workload-Policy-Y…
Browse files Browse the repository at this point in the history
…aml-Documentation

V2.18 run 17656 workload policy yaml documentation
  • Loading branch information
jasonnovichRunAI authored Jul 4, 2024
2 parents 1420f44 + c65453e commit 3303f9f
Show file tree
Hide file tree
Showing 14 changed files with 324 additions and 30 deletions.
10 changes: 4 additions & 6 deletions docs/Researcher/Walkthroughs/walkthrough-build-ports.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,23 +26,21 @@

``` bash
runai config project team-a
runai submit nginx-test -i zembutsu/docker-sample-nginx --interactive \
--service-type portforward --port 8080:80
runai submit nginx-test -i zembutsu/docker-sample-nginx --interactive
runai port-forward nginx-test --port 8080:80
```

* The Job is based on a sample _NGINX_ webserver docker image `zembutsu/docker-sample-nginx`. Once accessed via a browser, the page shows the container name.
* Note the _interactive_ flag which means the Job will not have a start or end. It is the Researcher's responsibility to close the Job.
* In this example, we have chosen the simplest scheme to expose ports which is port forwarding. We temporarily expose port 8080 to localhost as long as the `runai submit` command is not stopped
* In this example, we have chosen the simplest scheme to expose ports which is port forwarding. We temporarily expose port 8080 to localhost as long as the `runai port-forward` command is not stopped
* It is possible to forward traffic from multiple IP addresses by using the "--address" parameter. Check the CLI reference for further details.

The result will be:

``` bash
The job 'nginx-test-0' has been submitted successfully
You can run `runai describe job nginx-test-0 -p team-a` to check the job status
Waiting for pod to start running...
INFO[0023] Job started
Open access point(s) to service from localhost:8080

Forwarding from 127.0.0.1:8080 -> 80
Forwarding from [::1]:8080 -> 80
```
Expand Down
24 changes: 24 additions & 0 deletions docs/Researcher/cli-reference/runai-submit-dist-TF.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,30 @@ runai submit-dist tf --name distributed-job --workers=2 -g 1 \

> Set labels variables in the container.
#### --master-args string `<string>`

> Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
#### --master-environment `<stringArray>`

> Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
#### --master-extended-resource `<stringArray>`

> Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
#### --master-gpu `<float>`

> GPU units to allocate for the master pod.
#### --master-no-pvcs

> Do not mount any persistent volumes in the master pod.
#### --no-master

> Do not create a separate pod for the master.
#### --preferred-pod-topology-key `<string>`

> If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.
Expand Down
20 changes: 20 additions & 0 deletions docs/Researcher/cli-reference/runai-submit-dist-mpi.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,26 @@ You can start an unattended mpi training Job of name dist1, based on Project *te

> Set labels variables in the container.
#### --master-args string `<string>`

> Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
#### --master-environment `<stringArray>`

> Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
#### --master-extended-resource `<stringArray>`

> Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
#### --master-gpu `<float>`

> GPU units to allocate for the master pod.
#### --master-no-pvcs

> Do not mount any persistent volumes in the master pod.
#### --preferred-pod-topology-key `<string>`

> If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.
Expand Down
33 changes: 26 additions & 7 deletions docs/Researcher/cli-reference/runai-submit-dist-pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
#### --image `<string>` | -i `<string>`

> Image to use when creating the container for this Job
> Image to use when creating the container for this Job.
#### --image-pull-policy `<string>`

Expand All @@ -107,6 +107,30 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \

> Set labels variables in the container.
#### --master-args string `<string>`

> Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
#### --master-environment `<stringArray>`

> Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
#### --master-extended-resource `<stringArray>`

> Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
#### --master-gpu `<float>`

> GPU units to allocate for the master pod.
#### --master-no-pvcs

> Do not mount any persistent volumes in the master pod.
#### --no-master

> Do not create a separate pod for the master.
#### --preferred-pod-topology-key `<string>`

> If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.
Expand Down Expand Up @@ -361,9 +385,4 @@ runai submit-dist pytorch --name distributed-job --workers=2 -g 1 \
## Output
The command will attempt to submit an _mpi_ Job. You can follow up on the Job by running `runai list jobs` or `runai describe job <job-name>`.
## See Also
< please let me know if this is needed, or if additional documentation is needed in the link >
* See Quickstart document [Running Distributed Training](../Walkthroughs/walkthrough-distributed-training.md).
The command will attempt to submit a _distributed pytorch_ workload. You can follow up on the workload by running `runai list jobs` or `runai describe job <job-name>`.
20 changes: 20 additions & 0 deletions docs/Researcher/cli-reference/runai-submit-dist-xgboost.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,26 @@ runai submit-dist xgboost --name distributed-job --workers=2 -g 1 \

> Set labels variables in the container.
#### --master-args string `<string>`

> Arguments to pass to the master pod container command. If used together with `--command`, overrides the image's entrypoint of the master pod container with the given command.
#### --master-environment `<stringArray>`

> Set environment variables in the master pod container. To prevent from a worker environment variable from being set in the master, use the format: `name=-`.
#### --master-extended-resource `<stringArray>`

> Request access to an extended resource in the master pod. Use the format: `resource_name=quantity`.
#### --master-gpu `<float>`

> GPU units to allocate for the master pod.
#### --master-no-pvcs

> Do not mount any persistent volumes in the master pod.
#### --preferred-pod-topology-key `<string>`

> If possible, all pods of this job will be scheduled onto nodes that have a label with this key and identical values.
Expand Down
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/access-control/rbac.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ RBAC uses [rules](#access-rules) to ensure that only authorized users or applica

* Departments
* Projects
* Deployments
* Inference
* Workspaces
* Environments
* Quota management dashboard
Expand Down
2 changes: 2 additions & 0 deletions docs/admin/runai-setup/cluster-setup/cluster-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,8 @@ Following is a Kubernetes support matrix for the latest Run:ai releases:<a name=
| Run:ai 2.17 | 1.27 through 1.29 | 4.12 through 4.15 |
| Run:ai 2.18 | 1.28 through 1.30 | 4.12 through 4.15 |

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. Within these notes, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with Run:ai.

For an up-to-date end-of-life statement of Kubernetes see [Kubernetes Release History](https://kubernetes.io/releases/){target=_blank}.

!!! Note
Expand Down
47 changes: 33 additions & 14 deletions docs/admin/runai-setup/self-hosted/k8s/backend.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Install the Run:ai Control Plane
# Install the Run:ai Control Plane

## Prerequisites and preperations

Expand All @@ -16,7 +16,7 @@ Run the helm command below:
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane --version "~2.17.0" \
--set global.domain=<DOMAIN> # (1)
```

1. Domain name described [here](prerequisites.md#domain-name).

!!! Info
Expand All @@ -29,18 +29,17 @@ Run the helm command below:
--set global.customCA.enabled=true \ # (3)
-n runai-backend -f custom-env.yaml # (4)
```

1. Replace `<VERSION>` with the Run:ai control plane version.
2. Domain name described [here](prerequisites.md#domain-name).
3. See the Local Certificate Authority instructions below
4. `custom-env.yaml` should have been created by the _prepare installation_ script in the previous section.

!!! Tip
Use the `--dry-run` flag to gain an understanding of what is being installed before the actual installation.


Use the `--dry-run` flag to gain an understanding of what is being installed before the actual installation.

### Additional configurations (optional)

There may be cases where you need to set additional properties as follows:

| Key | Change | Description |
Expand All @@ -62,8 +61,14 @@ There may be cases where you need to set additional properties as follows:
| `grafana.dbPassword` | Grafana database password | Password for the Grafana database user |
| `grafana.adminUser` | Grafana username | Override the Run:ai default user name for accessing Grafana |
| `grafana.adminPassword` | Grafana password | Override the Run:ai default password for accessing Grafana |
| `grafana.dbUser` | Grafana's username for PostgreSQL | Override the Run:ai default user name for Grafana to access Run:ai database (PostgreSQL) |
| `grafana.dbPassword` | Grafana's password for PostgreSQL | Override the Run:ai default password for Grafana to access Run:ai database (PostgreSQL) |
| `grafana.grafana.ini.database.user` | Reference to Grafana's username for PostgreSQL | Don't override this value |
| `grafana.grafana.ini.database.password` | Reference to Grafana's password for PostgreSQL | Don't override this value |
| `tenantsManager.config.adminUsername` | Run:ai first admin username | Override the default user name of the first admin user created with Run:ai |
| `tenantsManager.config.adminPassword` | Run:ai first admin user's password | Override the default password of the first admin user created with Run:ai |
| `thanos.receive.persistence.storageClass` and `postgresql.primary.persistence.storageClass` | Storage class | The installation to work with a specific storage class rather than the default one |
| `<component>` <br> &ensp;`resources:` <br> &emsp; `limits:` <br> &emsp; &ensp; `cpu: 500m` <br> &emsp; &ensp; `memory: 512Mi` <br> &emsp; `requests:` <br> &emsp; &ensp; `cpu: 250m` <br> &emsp; &ensp; `memory: 256Mi` | Pod request and limits | `<component>` may be anyone of the following: `backend`, `frontend`, `assetsService`, `identityManager`, `tenantsManager`, `keycloakx`, `grafana`, `authorization`, `orgUnitService`,`policyService` |
| `<component>` <br> &ensp;`resources:` <br> &emsp; `limits:` <br> &emsp; &ensp; `cpu: 500m` <br> &emsp; &ensp; `memory: 512Mi` <br> &emsp; `requests:` <br> &emsp; &ensp; `cpu: 250m` <br> &emsp; &ensp; `memory: 256Mi` | Pod request and limits | `<component>` may be anyone of the following: `backend`, `frontend`, `assetsService`, `identityManager`, `tenantsManager`, `keycloakx`, `grafana`, `authorization`, `orgUnitService`,`policyService` |
|<div style="width:200px"></div>| | |

Use the `--set` syntax in the helm command above.
Expand All @@ -80,24 +85,38 @@ If you have opted to connect to an [external PostgreSQL database](preperations.m
* `grafana.dbUser`
* `grafana.dbPassword`

#### External PostgreSQL database

If you have opted to connect to an [external PostgreSQL database](preperations.md#external-postgres-database-optional), refer to the additional configurations table below. Adjust the following parameters based on your connection details:

* `postgresql.enabled` - set to `false`
* `global.postgresql.auth.password`
* `global.postgresql.auth.username`
* `global.postgresql.auth.host`
* `global.postgresql.auth.port`
* `grafana.dbUser`
* `grafana.dbPassword`

!!! Note
If you modify one of the usernames or passwords (KeyCloak, PostgreSQL, Grafana) after Run:ai is already installed, perform the following steps to apply the change:

1. Modify the username/password within the relevant component as well (KeyCloak, PostgreSQL, Grafana).
2. Run `helm upgrade` for Run:ai with the right values, and restart the relevant Run:ai pods so they can fetch the new username/password.

## Next Steps

### Connect to Run:ai User interface

Go to: `runai.<domain>`. Log in using the default credentials: User: `[email protected]`, Password: `Abcd!234`. Go to the Users area and change the password.
Go to: `runai.<domain>`. Log in using the default credentials: User: `[email protected]`, Password: `Abcd!234`. Go to the Users area and change the password.

### Enable Forgot Password (optional)

To support the *Forgot password* functionality, follow the steps below.

* Go to `runai.<domain>/auth` and Log in.
* Go to `runai.<domain>/auth` and Log in.
* Under `Realm settings`, select the `Login` tab and enable the `Forgot password` feature.
* Under the `Email` tab, define an SMTP server, as explained [here](https://www.keycloak.org/docs/latest/server_admin/#_email){target=_blank}


### Install Run:ai Cluster
Continue with installing a [Run:ai Cluster](cluster.md).




Continue with installing a [Run:ai Cluster](cluster.md).
2 changes: 1 addition & 1 deletion docs/admin/runai-setup/self-hosted/k8s/preparations.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Follow the prerequisites as explained in [Self-Hosted installation over Kubernet
Run the following script (you must dockerd installed and at least 20GB of free disk space to run):

```
./setup.sh
sudo -E ./prepare_installation.sh
```

If Docker is configured to [run as non-root](https://docs.docker.com/engine/install/linux-postinstall/#manage-docker-as-a-non-root-user){target=_blank} then `sudo` is not required.
Expand Down
Loading

0 comments on commit 3303f9f

Please sign in to comment.