Releases: GoogleCloudPlatform/ai-infra-cluster-provisioning
Releases · GoogleCloudPlatform/ai-infra-cluster-provisioning
GKE cluster and Node Pool support.
What's Changed
- Merging main with develop. by @soumyapani in #97
- Adding support for creating a GKE cluster using provisioning tool. by @soumyapani in #83
- fix #99 - coalescelist error with node_pools by @stevenBorisko in #100
- Disable dashboard creation when ops agent is not enabled. by @soumyapani in #107
- Adding node pool expansion for GKE cluster. by @soumyapani in #103
- address #89 - create metric decriptors before dashboard by @stevenBorisko in #101
- Remove Docker credential helper step by @sdlin in #110
- Main to develop by @stevenBorisko in #112
- Main to develop by @stevenBorisko in #114
- add terraform tests by @stevenBorisko in #91
- Using latest startup-script module to support replace startup script for cluster recreate. by @soumyapani in #123
- env var fixes by @stevenBorisko in #124
- fix typo in orchestrator validation by @stevenBorisko in #125
-
- Integrating GKE GPU utilization dashboard 2. GPU dashboards from cloud-monitoring-dashboards enhancement by @soumyapani in #126
-
- Making Disk size and disk type configurable 2. Making GKE CIDR configurable. by @soumyapani in #130
- Borisko/131 change fileshare to filestore by @stevenBorisko in #134
- Borisko/122 remove metric descriptor hack by @stevenBorisko in #133
- Use GKE version 1.25. Remove PSP and istio from GKE cluster creation. by @soumyapani in #136
- Making gke version configurable. by @soumyapani in #137
- Creating regional GKE cluster with zonal node pool. by @soumyapani in #140
-
- Increasing Terraform parallelization for GKE cluster creation 2. Adding support for COS images. by @soumyapani in #142
- Release v0.5.0 by @soumyapani in #144
- Fixing continuous test. by @soumyapani in #146
New Contributors
Full Changelog: v0.4.0...v0.5.0
v0.4.0: Ops agent for GPU metric, DLVM bug fix and other bug fixes
What's Changed
- Adding HPC toolkit blueprint to use aiinfra cluster provisioning tool. by @soumyapani in #74
- add scopes to default service account in startup script by @stevenBorisko in #77
- Enabling internet access only for primary network in multi-NIC VPC. by @soumyapani in #78
- New Ops agent installation for GPU metric and corresponding Cloud Monitoring Dashboard by @stevenBorisko in #82
- Temporary disable GVNIC since DLVM images do not support GVNIC. by @soumyapani in #90
- Making orchestrator configurable.
- Adding disable_notebook flag. by @soumyapani in #94
- Release v0.4.0 by @soumyapani in #96
New Contributors
- @stevenBorisko made their first contribution in #77
Full Changelog: v0.3.0...v0.4.0
v0.3.0: migration to pure terraform, support for minimal verbosity and disable ops agent installation
What's Changed
- Merging main with develop by @soumyapani in #63
- Creating full terraform module for aiinfra-cluster by @soumyapani in #65
- Temporarily disabling OPS agent installation. by @soumyapani in #68
- Release v0.3.0 by @soumyapani in #71
Full Changelog: v0.2.0...v0.3.0
v0.2.0: converged networking module and minimal terraform verbosity support
What's Changed
- Converging network creation to a single module. by @soumyapani in #55
- Update README.md by @soumyapani in #56
- Supporting minimal terraform verbosity for running with LLM pipeline. by @soumyapani in #61
Full Changelog: v0.1.0...v0.2.0
v0.1.0: MIG with Multi-NIC, GCSFuse and Fileshare support
What's Changed
-
- Adding GCB file for PR validation. 2. Bug: 260149974: Adding support for passing Action via command line. by @soumyapani in #2
- Adding roadmap file for cluster provisioning. by @soumyapani in #13
- Prompt for cluster connection and other enhancements. by @soumyapani in #24
- 1.Removing CLEANUP_ON_EXIT behavior 2. Use single gcs bucket per project for storing state by @soumyapani in #25
- Fixing copy directory usage and updating README.md by @soumyapani in #29
- Update README.md by @DmitryKakurin in #32
- Adding example training script. by @soumyapani in #30
-
- fixing continuous GCB config.2. Adding documentation for storage object owner access. by @soumyapani in #36
- Fixing GCB config for PR. by @soumyapani in #37
- Fixing datamodel for tensorflow script. Adding resnet training example with ray for pytorch. by @soumyapani in #38
- GCB and Debugging Improvements. by @soumyapani in #39
- Updating test env.list file for PR. by @soumyapani in #41
- Using HPC toolkit modules. by @soumyapani in #43
- Adding support for GCS mount. by @soumyapani in #45
- Fixing Dockerfile to use the right base image for gcloud. by @soumyapani in #47
- Removing local copy of startup-script module and using HPC module. by @soumyapani in #48
- Adding multi-nic support for MIG by @soumyapani in #50
- Adding support for NFS fileshare. by @soumyapani in #51
- Release v0.1.0 by @soumyapani in #52
New Contributors
- @soumyapani made their first contribution in #2
- @DmitryKakurin made their first contribution in #32
Full Changelog: https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/commits/v0.1.0