Skip to content

Commit

Permalink
Release v1.0.0 (#287)
Browse files Browse the repository at this point in the history
  • Loading branch information
stevenBorisko authored Aug 18, 2023
2 parents 813a8db + 9ce1f70 commit c60acf8
Show file tree
Hide file tree
Showing 211 changed files with 2,149 additions and 4,317 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
/*.list
**/.terraform*
**/terraform.tf*
!samples/**/terraform.tfvars
!**/samples/**/terraform.tfvars
**/tfplan
**/backend.tf
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ RUN curl -s "https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terra
&& unzip -uq ./terraform.zip \
&& rm -f ./terraform.zip \
&& mv ./terraform /root/.local/bin/terraform
COPY terraform ./terraform
COPY ./a3/terraform ./a3/terraform


FROM base as test
Expand All @@ -31,7 +31,7 @@ ENTRYPOINT ["./test/continuous/run.sh"]


FROM base as deploy
RUN for cluster in gke gke-beta mig mig-with-container slurm; do \
terraform -chdir="./terraform/modules/cluster/${cluster}" init; done
RUN for cluster in gke gke-beta mig mig-cos slurm; do \
terraform -chdir="./a3/terraform/modules/cluster/${cluster}" init; done
COPY scripts ./scripts
ENTRYPOINT ["./scripts/entrypoint.sh"]
139 changes: 0 additions & 139 deletions Quickstart.md

This file was deleted.

127 changes: 34 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,88 +3,26 @@
## Purpose

The purpose of this tool is to provide a very quick and simple way to provision
a Google Cloud Platform (GCP) A3 compute cluster.
Google Cloud Platform (GCP) compute clusters of specifically
[accelerator optimized machines](https://cloud.google.com/compute/docs/accelerator-optimized-machines).

## What is an A3 cluster?
## Machine Type Comparison

An A3 cluster provides the following resources:

- one or many `a3-highgpu-8g` VM instances (documentation not available yet,
description
[here](https://cloud.google.com/blog/products/compute/introducing-a3-supercomputers-with-nvidia-h100-gpus))
- five virtual network interface cards (vNIC) -- two Nvidia H100 GPUs connected to each vNIC
- TCPX (documentation not available yet)

## Control plane options

A3 clusters may be created through either [GKE](https://cloud.google.com/kubernetes-engine) or a [MIG](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups) via the modules found [here](./terraform/modules/cluster). Due to the recency of A3's release, features are limited in each control plane, and those limitations are listed below.

| Feature \ Module | `gke` | `mig-with-container` |
| Feature \ Machine | A2 | [A3](./a3) |
| --- | --- | --- |
| [VM Image](https://cloud.google.com/compute/docs/images) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) | [COS-Cloud](https://cloud.google.com/container-optimized-os/docs) |
| [Compact placement policy](https://cloud.google.com/compute/docs/instances/define-instance-placement) | Yes | Yes |
| [Kubernetes](https://kubernetes.io/) support | Yes | No |

## Quickstart with `gke`

An A3 cluster of eight nodes (two node pools with four nodes per node pool) booting with a COS-Cloud image can be created via GKE by running the following two commands:

```bash
cat >./terraform.tfvars <<EOF
project_id = "my-project"
resource_prefix = "my-cluster"
region = "us-central1"
gke_version = "1.27.4-gke.900"
disk_type = "pd-ssd"
node_pools = [
{
zone = "us-central1-c"
node_count = 4
machine_type = "a3-highgpu-8g"
},
{
zone = "us-central1-c"
node_count = 4
machine_type = "a3-highgpu-8g"
},
]
EOF

docker run --rm -v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create gke-beta
```

A deeper dive into how to use this tool can be found [below](#how-to-provision-a-cluster).
| Nvidia GPU Type | [A100](https://www.nvidia.com/en-us/data-center/a100/) -- 40GB and 80GB | [H100 80GB](https://www.nvidia.com/en-us/data-center/h100/) |
| VM Shapes | [Several](https://cloud.google.com/compute/docs/gpus#a100-gpus) | 8 GPUs |
| GPUDirect-TCPX | Unsupported | Supported |
| Multi-NIC | Unsupported | 5 vNICS -- 1 for CPU and 4 for GPUs (one per pair of GPUs) |

## Quickstart with `mig-with-container`

An A3 cluster of eight nodes booting with a COS-Cloud image can be created via a managed instance group by running the following two commands:

```bash
cat >./terraform.tfvars <<EOF
project_id = "my-project"
resource_prefix = "my-cluster"
target_size = 8
zone = "us-central1-c"
EOF

docker run --rm -v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create mig-with-container
```

A deeper dive into how to use this tool can be found [below](#how-to-provision-a-cluster).

## Repository content summary
## Repository Content Summary

This repository contains:

- a set of terraform modules that creates GCP resources geared toward running
AI/ML workloads on [A3 VMs](#what-is-an-a3-cluster).
- an [entrypoint script](./scripts/entrypoint.sh) that will find or create a
- sets of terraform modules that create GCP resources, each tailored toward
running AI/ML workloads on a specific
[accelerator optimized machine type](https://cloud.google.com/compute/docs/accelerator-optimized-machines).
- an [entrypoint script](../scripts/entrypoint.sh) that will find or create a
terraform backend in a Google Cloud Storage (GCS) bucket, call the
appropriate terraform commands using the terraform modules and a user
provided terraform variables (`tfvars`) file, and upload all logs to the GCS
Expand All @@ -94,7 +32,9 @@ This repository contains:
-- that has all necessary tools installed which calls the entrypoint script
and creates a cluster for you.

# Prerequisites
# How to provision a cluster

## Prerequisites

In order to provision a cluster, the following are required:

Expand All @@ -103,7 +43,7 @@ In order to provision a cluster, the following are required:
[`roles/editor`](https://cloud.google.com/iam/docs/understanding-roles#basic).
- [`gcloud` authorization](https://cloud.google.com/sdk/docs/authorizing): explained below.

## Google Cloud Authentication
### Google Cloud Authentication

The command to authorize tools to create resources on your behalf is:

Expand All @@ -122,7 +62,7 @@ The above command is:
- necessary when using this repository in an existing terraform module or
HPC-Toolkit blueprint.

# How to provision a cluster
## Methods

After running through the [prerequisites above](#prerequisites), there are a
few ways to provision a cluster:
Expand All @@ -137,12 +77,12 @@ few ways to provision a cluster:
(or plan to have) an HPC Toolkit Blueprint and would like to have the same
`ghpc deploy` create this cluster along with all your other infrastructure.

## Run the docker image
### Run the docker image

For this method, all you need (in addition to the above requirements) is a
`terraform.tfvars` file (user generated or copied from an [example](./samples)) in your
current directory and the ability to run [docker](https://www.docker.com/). In
a terminal, change your current working directory to this one and run the command:
`terraform.tfvars` file (user generated or copied from an example --
[a3](./a3/examples)) in your current directory and the ability to run
[docker](https://www.docker.com/). In a terminal, run:

```bash
# create/update the cluster
Expand All @@ -151,15 +91,15 @@ docker run \
-v "${HOME}/.config/gcloud:/root/.config/gcloud" \
-v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create mig-with-container
create a3 mig-cos

# destroy the cluster
docker run \
--rm \
-v "${HOME}/.config/gcloud:/root/.config/gcloud" \
-v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
destroy mig-with-container
destroy a3 mig-cos
```

Quick explanation of the `docker run` flags (in same order as above):
Expand All @@ -170,16 +110,17 @@ Quick explanation of the `docker run` flags (in same order as above):
the container so the tool can read the `terraform.tfvars` file.
- `create/destroy` tells the tool whether it should create or destroy the whole
cluster.
- `mig-with-container` tells the tool to create a Managed Instance Group and
- `a3` specifies which type of cluster to provision -- this will influence mainly machine type, networking, and startup scripts.
- `mig-cos` tells the tool to create a Managed Instance Group and
start a container at boot.

## Integrate into an existing terraform project
### Integrate into an existing terraform project

For this method, you need to
[install terraform](https://developer.hashicorp.com/terraform/downloads).
Examples of usage as a terraform module can be found in the `main.tf` files in
any of the directories found [here](./samples/a3). Cluster provisioning then
happens the same as any other terraform:
any of the examples -- [a3](./a3/examples). Cluster provisioning then happens
the same as any other terraform:

```bash
# assuming the directory containing main.tf is the current working directory
Expand All @@ -191,21 +132,21 @@ terraform init && terraform validate && terraform apply
terraform init && terraform validate && terraform apply -destroy
```

## Integrate into an existing HPC Toolkit Blueprint
### Integrate into an existing HPC Toolkit Blueprint

For this method, you need to
[build ghpc](https://github.com/GoogleCloudPlatform/hpc-toolkit#quickstart).
The `a3-mig-with-container` deployment group in the `blueprint.yaml` shows how
to use the `mig-with-container` module in your HPC Toolkit Blueprint. Cluster
Examples of usage as an HPC Toolkit Blueprint can be found in the
`blueprint.yaml` files in any of the examples -- [a3](./a3/examples). Cluster
provisioning then happens the same as any blueprint:

```bash
# assuming the ghpc binary and blueprint.yaml are both in
# the current working directory

# create/update the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc deploy a3-mig-with-container
./ghpc create -w ./blueprint.yaml && ./ghpc deploy a3-mig-cos

# destroy the cluster
./ghpc create -w ./blueprint.yaml && ./ghpc destroy a3-mig-with-container
./ghpc create -w ./blueprint.yaml && ./ghpc destroy a3-mig-cos
```
Loading

0 comments on commit c60acf8

Please sign in to comment.