Skip to content

Commit

Permalink
Release v1.3.0 (#317)
Browse files Browse the repository at this point in the history
  • Loading branch information
soumyapani authored Oct 6, 2023
2 parents 64a5d8a + 5314dee commit d694121
Show file tree
Hide file tree
Showing 54 changed files with 2,429 additions and 3 deletions.
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ RUN curl -s "https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terra
&& rm -f ./terraform.zip \
&& mv ./terraform /root/.local/bin/terraform
COPY ./a3/terraform ./a3/terraform
COPY ./a2/terraform ./a2/terraform


FROM base as test
Expand All @@ -33,5 +34,7 @@ ENTRYPOINT ["./test/continuous/run.sh"]
FROM base as deploy
RUN for cluster in gke gke-beta mig mig-cos slurm; do \
terraform -chdir="./a3/terraform/modules/cluster/${cluster}" init; done
RUN for cluster in mig; do \
terraform -chdir="./a2/terraform/modules/cluster/${cluster}" init; done
COPY scripts ./scripts
ENTRYPOINT ["./scripts/entrypoint.sh"]
34 changes: 34 additions & 0 deletions a2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Overview

## Control Plane Options

A2 clusters are created through a [MIG](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups) via the modules found [here](./terraform/modules/cluster).

## Quickstart with `mig`

An A2 cluster of 4 `a2-highgpu-1g` nodes (2 instance groups with 2 instances each) booting with a DLVM image can be created via a managed instance group by running the following two commands:

```bash
cat >./terraform.tfvars <<EOF
instance_groups = [
{
target_size = 2
zone = "us-central1-c"
},
{
target_size = 2
zone = "us-central1-c"
},
]
project_id = "my-project"
region = "us-central1"
resource_prefix = "my-cluster"
EOF

docker run --rm -v "${HOME}/.config/gcloud:/root/.config/gcloud" \
-v "${PWD}:/root/aiinfra/input" \
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \
create a2 mig
```

A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster).
60 changes: 60 additions & 0 deletions a2/examples/mig/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# The cluster

This configuration creates two Managed Instance Groups of four
[`a2-highgpu-1g`](https://cloud.google.com/compute/docs/gpus#a100-40gb)
VM instances each (eight instances in total). Each instance has:
- eight [NVidia A100 GPUs](https://www.nvidia.com/en-us/data-center/a100/),
- one [NIC]
- a [DLVM](https://cloud.google.com/deep-learning-vm) machine
image,
- Nvidia GPU drivers

# The tfvars file

The `terraform.tfvars` file is what configures the cluster. Detailed
descriptions of each variable can be found in
[this `README`](../../terraform/modules/cluster/mig/README.md).
All optional variables may be omitted to use their default values.

Required variables:
- `instance_groups`
- `project_id`
- `region`
- `resource_prefix`

# How to create this cluster

Refer to [this section](../../../a2/README.md#quickstart-with-mig).

<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
## Requirements

No requirements.

## Providers

No providers.

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_a2-mig"></a> [a2-mig](#module\_a2-mig) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig | n/a |

## Resources

No resources.

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | n/a | `any` | n/a | yes |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | n/a | `any` | n/a | yes |
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes |

## Outputs

No outputs.
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
35 changes: 35 additions & 0 deletions a2/examples/mig/blueprint.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---

blueprint_name: a2-mig

vars:
deployment_name: a2-mig

instance_groups:
- target_size: 4
zone: us-east4-a
- target_size: 4
zone: us-east4-a
project_id: my-project-id
region: us-east4
resource_prefix: my-cluster-name

deployment_groups:
- group: primary
modules:
- id: a2-mig
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig"
13 changes: 13 additions & 0 deletions a2/examples/mig/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
variable "instance_groups" {}
variable "project_id" {}
variable "region" {}
variable "resource_prefix" {}

module "a2-mig" {
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig"

instance_groups = var.instance_groups
project_id = var.project_id
region = var.region
resource_prefix = var.resource_prefix
}
73 changes: 73 additions & 0 deletions a2/terraform/modules/cluster/mig/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Requirements

No requirements.

## Providers

| Name | Version |
|------|---------|
| <a name="provider_google-beta"></a> [google-beta](#provider\_google-beta) | n/a |

## Modules

| Name | Source | Version |
|------|--------|---------|
| <a name="module_compute_instance_template"></a> [compute\_instance\_template](#module\_compute\_instance\_template) | ../../common/instance_template | n/a |
| <a name="module_dashboard"></a> [dashboard](#module\_dashboard) | ../../common/dashboard | n/a |
| <a name="module_filestore"></a> [filestore](#module\_filestore) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/filestore// | v1.17.0 |
| <a name="module_gcsfuse"></a> [gcsfuse](#module\_gcsfuse) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/pre-existing-network-storage// | v1.17.0 |
| <a name="module_network"></a> [network](#module\_network) | ../../common/network | n/a |
| <a name="module_startup"></a> [startup](#module\_startup) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script/ | v1.17.0 |

## Resources

| Name | Type |
|------|------|
| [google-beta_google_compute_instance_group_manager.mig](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_compute_instance_group_manager) | resource |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | The size of the image in gigabytes for the boot disk of each instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-size). | `number` | `128` | no |
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | The GCE disk type for the boot disk of each instance.<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-type). | `string` | `"pd-ssd"` | no |
| <a name="input_enable_ops_agent"></a> [enable\_ops\_agent](#input\_enable\_ops\_agent) | Install [Google Cloud Ops Agent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent). | `bool` | `true` | no |
| <a name="input_enable_ray"></a> [enable\_ray](#input\_enable\_ray) | Install [Ray](https://docs.ray.io/en/latest/cluster/getting-started.html). | `bool` | `false` | no |
| <a name="input_filestore_new"></a> [filestore\_new](#input\_filestore\_new) | Configurations to mount newly created network storage. Each object describes NFS file-servers to be hosted in Filestore.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#inputs).<br><br> ------------<br> `filestore_new.filestore_tier`<br><br> The service tier of the instance.<br><br> Possible values: `["BASIC_HDD", "BASIC_SSD", "HIGH_SCALE_SSD", "ENTERPRISE"]`.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_filestore_tier), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--tier).<br><br> ------------<br> `filestore_new.local_mount`<br><br> Mountpoint for this filestore instance.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount).<br><br> ------------<br> `filestore_new.size_gb`<br><br> Storage size of the filestore instance in GB.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--file-share).<br>-<br> `filestore_new.zone`<br><br> Location for filestore instance.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/f | <pre>list(object({<br> filestore_tier = string<br> local_mount = string<br> size_gb = number<br> zone = string<br> }))</pre> | `[]` | no |
| <a name="input_gcsfuse_existing"></a> [gcsfuse\_existing](#input\_gcsfuse\_existing) | Configurations to mount existing network storage. Each object describes Cloud Storage Buckets to be mounted with Cloud Storage FUSE.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#inputs).<br><br>------------<br>`gcsfuse_existing.local_mount`<br><br>The mount point where the contents of the device may be accessed after mounting.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_local_mount).<br><br>------------<br>`gcsfuse_existing.remote_mount`<br><br>Bucket name without “gs://”.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_remote_mount). | <pre>list(object({<br> local_mount = string<br> remote_mount = string<br> }))</pre> | `[]` | no |
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | Required Fields:<br>- `target_size`: The number of running instances for this managed instance group. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#target_size), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--size).<br>- `zone`: The zone that instances in this group should be created in. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#zone), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--zone).<br>- `machine_type`: (Optional)The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource). Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type). | <pre>list(object({<br> zone = string<br> target_size = number<br> machine_type = optional(string, "a2-highgpu-1g")<br> }))</pre> | n/a | yes |
| <a name="input_labels"></a> [labels](#input\_labels) | The resource labels (a map of key/value pairs) to be applied to the GPU cluster.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#labels), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--labels). | `map(string)` | `{}` | no |
| <a name="input_machine_image"></a> [machine\_image](#input\_machine\_image) | The image with which this disk will initialize.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#source_image).<br><br>------------<br>`machine_image.family`<br><br>The family of images from which the latest non-deprecated image will be selected. Conflicts with `machine_image.name`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-family).<br><br>------------<br>`machine_image.name`<br><br>The name of a specific image. Conflicts with `machine_image.family`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image).<br><br>------------<br>`machine_image.project`<br><br>The project\_id to which this image belongs.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#project), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-project). | <pre>object({<br> family = string<br> name = string<br> project = string<br> })</pre> | <pre>{<br> "family": "pytorch-latest-gpu-debian-11-py310",<br> "name": null,<br> "project": "deeplearning-platform-release"<br>}</pre> | no |
| <a name="input_metadata"></a> [metadata](#input\_metadata) | GCE metadata to attach to each instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#metadata), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--metadata). | `map(string)` | `{}` | no |
| <a name="input_network_existing"></a> [network\_existing](#input\_network\_existing) | Existing network to attach to nic0. Setting to null will create a new network for it. | <pre>object({<br> network_name = string<br> subnetwork_name = string<br> })</pre> | `null` | no |
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | GCP Project ID to which the cluster will be deployed. | `string` | n/a | yes |
| <a name="input_region"></a> [region](#input\_region) | The region in which all instances will reside. | `string` | n/a | yes |
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | Arbitrary string with which all names of newly created resources will be prefixed. | `string` | n/a | yes |
| <a name="input_service_account"></a> [service\_account](#input\_service\_account) | Service account to attach to the instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#service_account).<br><br>------------<br>`service_account.email`<br><br>The service account e-mail address. If not given, the default Google Compute Engine service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#email), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--service-account).<br><br>------------<br>`service_account.scopes`<br><br>A list of service scopes. Both OAuth2 URLs and gcloud short names are supported. To allow full access to all Cloud APIs, use the `"cloud-platform"` scope. See a complete list of scopes [here](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/instances/set-scopes#--scopes).<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#scopes), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--scopes). | <pre>object({<br> email = string,<br> scopes = set(string)<br> })</pre> | `null` | no |
| <a name="input_startup_script"></a> [startup\_script](#input\_startup\_script) | Shell script -- the actual script (not the filename). | `string` | `null` | no |
| <a name="input_startup_script_file"></a> [startup\_script\_file](#input\_startup\_script\_file) | The full path in the VM to the shell script to be executed at VM startup. | `string` | `null` | no |
| <a name="input_startup_script_gcs_bucket_path"></a> [startup\_script\_gcs\_bucket\_path](#input\_startup\_script\_gcs\_bucket\_path) | The storage bucket full path to be used for storing the startup script.<br>Example: `gs://bucketName/dirName`<br><br>If the value is not provided, then a default storage bucket will be created for the script execution.<br>`storage.buckets.create` IAM permission is needed for creating the default storage bucket. | `string` | `null` | no |
| <a name="input_use_compact_placement_policy"></a> [use\_compact\_placement\_policy](#input\_use\_compact\_placement\_policy) | The flag to create and use a superblock level compact placement policy for the instances. Currently GCE supports using only 1 placement policy. | `bool` | `false` | no |
| <a name="input_wait_for_instances"></a> [wait\_for\_instances](#input\_wait\_for\_instances) | Whether to wait for all instances to be created/updated before returning. Note that if this is set to true and the operation does not succeed, Terraform will continue trying until it times out.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_instance_group_manager#wait_for_instances). | `bool` | `true` | no |

## Outputs

| Name | Description |
|------|-------------|
| <a name="output_instructions"></a> [instructions](#output\_instructions) | Instructions for accessing the dashboard |
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK -->
Loading

0 comments on commit d694121

Please sign in to comment.