-
Notifications
You must be signed in to change notification settings - Fork 27
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
54 changed files
with
2,429 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Overview | ||
|
||
## Control Plane Options | ||
|
||
A2 clusters are created through a [MIG](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups) via the modules found [here](./terraform/modules/cluster). | ||
|
||
## Quickstart with `mig` | ||
|
||
An A2 cluster of 4 `a2-highgpu-1g` nodes (2 instance groups with 2 instances each) booting with a DLVM image can be created via a managed instance group by running the following two commands: | ||
|
||
```bash | ||
cat >./terraform.tfvars <<EOF | ||
instance_groups = [ | ||
{ | ||
target_size = 2 | ||
zone = "us-central1-c" | ||
}, | ||
{ | ||
target_size = 2 | ||
zone = "us-central1-c" | ||
}, | ||
] | ||
project_id = "my-project" | ||
region = "us-central1" | ||
resource_prefix = "my-cluster" | ||
EOF | ||
|
||
docker run --rm -v "${HOME}/.config/gcloud:/root/.config/gcloud" \ | ||
-v "${PWD}:/root/aiinfra/input" \ | ||
us-docker.pkg.dev/gce-ai-infra/cluster-provision-dev/cluster-provision-image:latest \ | ||
create a2 mig | ||
``` | ||
|
||
A deeper dive into how to use this tool can be found at the [top-level README](../README.md#how-to-provision-a-cluster). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# The cluster | ||
|
||
This configuration creates two Managed Instance Groups of four | ||
[`a2-highgpu-1g`](https://cloud.google.com/compute/docs/gpus#a100-40gb) | ||
VM instances each (eight instances in total). Each instance has: | ||
- eight [NVidia A100 GPUs](https://www.nvidia.com/en-us/data-center/a100/), | ||
- one [NIC] | ||
- a [DLVM](https://cloud.google.com/deep-learning-vm) machine | ||
image, | ||
- Nvidia GPU drivers | ||
|
||
# The tfvars file | ||
|
||
The `terraform.tfvars` file is what configures the cluster. Detailed | ||
descriptions of each variable can be found in | ||
[this `README`](../../terraform/modules/cluster/mig/README.md). | ||
All optional variables may be omitted to use their default values. | ||
|
||
Required variables: | ||
- `instance_groups` | ||
- `project_id` | ||
- `region` | ||
- `resource_prefix` | ||
|
||
# How to create this cluster | ||
|
||
Refer to [this section](../../../a2/README.md#quickstart-with-mig). | ||
|
||
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> | ||
## Requirements | ||
|
||
No requirements. | ||
|
||
## Providers | ||
|
||
No providers. | ||
|
||
## Modules | ||
|
||
| Name | Source | Version | | ||
|------|--------|---------| | ||
| <a name="module_a2-mig"></a> [a2-mig](#module\_a2-mig) | github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig | n/a | | ||
|
||
## Resources | ||
|
||
No resources. | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:--------:| | ||
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | n/a | `any` | n/a | yes | | ||
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | n/a | `any` | n/a | yes | | ||
| <a name="input_region"></a> [region](#input\_region) | n/a | `any` | n/a | yes | | ||
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | n/a | `any` | n/a | yes | | ||
|
||
## Outputs | ||
|
||
No outputs. | ||
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
# Copyright 2022 Google LLC | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
--- | ||
|
||
blueprint_name: a2-mig | ||
|
||
vars: | ||
deployment_name: a2-mig | ||
|
||
instance_groups: | ||
- target_size: 4 | ||
zone: us-east4-a | ||
- target_size: 4 | ||
zone: us-east4-a | ||
project_id: my-project-id | ||
region: us-east4 | ||
resource_prefix: my-cluster-name | ||
|
||
deployment_groups: | ||
- group: primary | ||
modules: | ||
- id: a2-mig | ||
source: "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
variable "instance_groups" {} | ||
variable "project_id" {} | ||
variable "region" {} | ||
variable "resource_prefix" {} | ||
|
||
module "a2-mig" { | ||
source = "github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning//a2/terraform/modules/cluster/mig" | ||
|
||
instance_groups = var.instance_groups | ||
project_id = var.project_id | ||
region = var.region | ||
resource_prefix = var.resource_prefix | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
<!-- BEGINNING OF PRE-COMMIT-TERRAFORM DOCS HOOK --> | ||
Copyright 2022 Google LLC | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
|
||
## Requirements | ||
|
||
No requirements. | ||
|
||
## Providers | ||
|
||
| Name | Version | | ||
|------|---------| | ||
| <a name="provider_google-beta"></a> [google-beta](#provider\_google-beta) | n/a | | ||
|
||
## Modules | ||
|
||
| Name | Source | Version | | ||
|------|--------|---------| | ||
| <a name="module_compute_instance_template"></a> [compute\_instance\_template](#module\_compute\_instance\_template) | ../../common/instance_template | n/a | | ||
| <a name="module_dashboard"></a> [dashboard](#module\_dashboard) | ../../common/dashboard | n/a | | ||
| <a name="module_filestore"></a> [filestore](#module\_filestore) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/filestore// | v1.17.0 | | ||
| <a name="module_gcsfuse"></a> [gcsfuse](#module\_gcsfuse) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/file-system/pre-existing-network-storage// | v1.17.0 | | ||
| <a name="module_network"></a> [network](#module\_network) | ../../common/network | n/a | | ||
| <a name="module_startup"></a> [startup](#module\_startup) | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script/ | v1.17.0 | | ||
|
||
## Resources | ||
|
||
| Name | Type | | ||
|------|------| | ||
| [google-beta_google_compute_instance_group_manager.mig](https://registry.terraform.io/providers/hashicorp/google-beta/latest/docs/resources/google_compute_instance_group_manager) | resource | | ||
|
||
## Inputs | ||
|
||
| Name | Description | Type | Default | Required | | ||
|------|-------------|------|---------|:--------:| | ||
| <a name="input_disk_size_gb"></a> [disk\_size\_gb](#input\_disk\_size\_gb) | The size of the image in gigabytes for the boot disk of each instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_size_gb), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-size). | `number` | `128` | no | | ||
| <a name="input_disk_type"></a> [disk\_type](#input\_disk\_type) | The GCE disk type for the boot disk of each instance.<br><br>Possible values: `["pd-ssd", "local-ssd", "pd-balanced", "pd-standard"]`<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--boot-disk-type). | `string` | `"pd-ssd"` | no | | ||
| <a name="input_enable_ops_agent"></a> [enable\_ops\_agent](#input\_enable\_ops\_agent) | Install [Google Cloud Ops Agent](https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent). | `bool` | `true` | no | | ||
| <a name="input_enable_ray"></a> [enable\_ray](#input\_enable\_ray) | Install [Ray](https://docs.ray.io/en/latest/cluster/getting-started.html). | `bool` | `false` | no | | ||
| <a name="input_filestore_new"></a> [filestore\_new](#input\_filestore\_new) | Configurations to mount newly created network storage. Each object describes NFS file-servers to be hosted in Filestore.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#inputs).<br><br> ------------<br> `filestore_new.filestore_tier`<br><br> The service tier of the instance.<br><br> Possible values: `["BASIC_HDD", "BASIC_SSD", "HIGH_SCALE_SSD", "ENTERPRISE"]`.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_filestore_tier), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--tier).<br><br> ------------<br> `filestore_new.local_mount`<br><br> Mountpoint for this filestore instance.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount).<br><br> ------------<br> `filestore_new.size_gb`<br><br> Storage size of the filestore instance in GB.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/filestore#input_local_mount), [gcloud](https://cloud.google.com/sdk/gcloud/reference/filestore/instances/create#--file-share).<br>-<br> `filestore_new.zone`<br><br> Location for filestore instance.<br><br> Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/f | <pre>list(object({<br> filestore_tier = string<br> local_mount = string<br> size_gb = number<br> zone = string<br> }))</pre> | `[]` | no | | ||
| <a name="input_gcsfuse_existing"></a> [gcsfuse\_existing](#input\_gcsfuse\_existing) | Configurations to mount existing network storage. Each object describes Cloud Storage Buckets to be mounted with Cloud Storage FUSE.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#inputs).<br><br>------------<br>`gcsfuse_existing.local_mount`<br><br>The mount point where the contents of the device may be accessed after mounting.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_local_mount).<br><br>------------<br>`gcsfuse_existing.remote_mount`<br><br>Bucket name without “gs://”.<br><br>Related docs: [hpc-toolkit](https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/main/modules/file-system/pre-existing-network-storage#input_remote_mount). | <pre>list(object({<br> local_mount = string<br> remote_mount = string<br> }))</pre> | `[]` | no | | ||
| <a name="input_instance_groups"></a> [instance\_groups](#input\_instance\_groups) | Required Fields:<br>- `target_size`: The number of running instances for this managed instance group. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#target_size), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--size).<br>- `zone`: The zone that instances in this group should be created in. Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_group_manager#zone), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-groups/managed/create#--zone).<br>- `machine_type`: (Optional)The name of a Google Compute Engine machine type. There are [many possible values](https://cloud.google.com/compute/docs/machine-resource). Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#machine_type), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--machine-type). | <pre>list(object({<br> zone = string<br> target_size = number<br> machine_type = optional(string, "a2-highgpu-1g")<br> }))</pre> | n/a | yes | | ||
| <a name="input_labels"></a> [labels](#input\_labels) | The resource labels (a map of key/value pairs) to be applied to the GPU cluster.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#labels), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--labels). | `map(string)` | `{}` | no | | ||
| <a name="input_machine_image"></a> [machine\_image](#input\_machine\_image) | The image with which this disk will initialize.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#source_image).<br><br>------------<br>`machine_image.family`<br><br>The family of images from which the latest non-deprecated image will be selected. Conflicts with `machine_image.name`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-family).<br><br>------------<br>`machine_image.name`<br><br>The name of a specific image. Conflicts with `machine_image.family`.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#name), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image).<br><br>------------<br>`machine_image.project`<br><br>The project\_id to which this image belongs.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/data-sources/compute_image#project), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--image-project). | <pre>object({<br> family = string<br> name = string<br> project = string<br> })</pre> | <pre>{<br> "family": "pytorch-latest-gpu-debian-11-py310",<br> "name": null,<br> "project": "deeplearning-platform-release"<br>}</pre> | no | | ||
| <a name="input_metadata"></a> [metadata](#input\_metadata) | GCE metadata to attach to each instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#metadata), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--metadata). | `map(string)` | `{}` | no | | ||
| <a name="input_network_existing"></a> [network\_existing](#input\_network\_existing) | Existing network to attach to nic0. Setting to null will create a new network for it. | <pre>object({<br> network_name = string<br> subnetwork_name = string<br> })</pre> | `null` | no | | ||
| <a name="input_project_id"></a> [project\_id](#input\_project\_id) | GCP Project ID to which the cluster will be deployed. | `string` | n/a | yes | | ||
| <a name="input_region"></a> [region](#input\_region) | The region in which all instances will reside. | `string` | n/a | yes | | ||
| <a name="input_resource_prefix"></a> [resource\_prefix](#input\_resource\_prefix) | Arbitrary string with which all names of newly created resources will be prefixed. | `string` | n/a | yes | | ||
| <a name="input_service_account"></a> [service\_account](#input\_service\_account) | Service account to attach to the instance.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#service_account).<br><br>------------<br>`service_account.email`<br><br>The service account e-mail address. If not given, the default Google Compute Engine service account is used.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#email), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--service-account).<br><br>------------<br>`service_account.scopes`<br><br>A list of service scopes. Both OAuth2 URLs and gcloud short names are supported. To allow full access to all Cloud APIs, use the `"cloud-platform"` scope. See a complete list of scopes [here](https://cloud.google.com/sdk/gcloud/reference/alpha/compute/instances/set-scopes#--scopes).<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#scopes), [gcloud](https://cloud.google.com/sdk/gcloud/reference/compute/instance-templates/create#--scopes). | <pre>object({<br> email = string,<br> scopes = set(string)<br> })</pre> | `null` | no | | ||
| <a name="input_startup_script"></a> [startup\_script](#input\_startup\_script) | Shell script -- the actual script (not the filename). | `string` | `null` | no | | ||
| <a name="input_startup_script_file"></a> [startup\_script\_file](#input\_startup\_script\_file) | The full path in the VM to the shell script to be executed at VM startup. | `string` | `null` | no | | ||
| <a name="input_startup_script_gcs_bucket_path"></a> [startup\_script\_gcs\_bucket\_path](#input\_startup\_script\_gcs\_bucket\_path) | The storage bucket full path to be used for storing the startup script.<br>Example: `gs://bucketName/dirName`<br><br>If the value is not provided, then a default storage bucket will be created for the script execution.<br>`storage.buckets.create` IAM permission is needed for creating the default storage bucket. | `string` | `null` | no | | ||
| <a name="input_use_compact_placement_policy"></a> [use\_compact\_placement\_policy](#input\_use\_compact\_placement\_policy) | The flag to create and use a superblock level compact placement policy for the instances. Currently GCE supports using only 1 placement policy. | `bool` | `false` | no | | ||
| <a name="input_wait_for_instances"></a> [wait\_for\_instances](#input\_wait\_for\_instances) | Whether to wait for all instances to be created/updated before returning. Note that if this is set to true and the operation does not succeed, Terraform will continue trying until it times out.<br><br>Related docs: [terraform](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_region_instance_group_manager#wait_for_instances). | `bool` | `true` | no | | ||
|
||
## Outputs | ||
|
||
| Name | Description | | ||
|------|-------------| | ||
| <a name="output_instructions"></a> [instructions](#output\_instructions) | Instructions for accessing the dashboard | | ||
<!-- END OF PRE-COMMIT-TERRAFORM DOCS HOOK --> |
Oops, something went wrong.