Releases: GoogleCloudPlatform/ai-infra-cluster-provisioning
Releases · GoogleCloudPlatform/ai-infra-cluster-provisioning
v1.5.0
What's Changed
- Fix NCCL_SOCKET_IFNAME typo in values.yaml under nccltest/gke by @hmhv1222 in #357
- Replace hardcoded parameters with environment variables in litgpt_container.sh by @samcmho in #359
- Update setup_and_launch_training.sh by @samcmho in #361
- Update README.md for all customers to cover all-to-all by @samcmho in #362
- remove default node pool deletion by @stevenBorisko in #351
- Pirillo/litgpt nvtx by @Chris113113 in #354
- Update lit_gpt commit to PyTorch 2.2 by @Chris113113 in #364
- Update setup_and_launch_training.sh by @samcmho in #363
- Merging Develop -> Main for sample_workloads changes by @Chris113113 in #366
- Add A3-Megagpu-8g SKU to tool. by @Chris113113 in #367
- Update NCCL link and rename a3-mega GKE in terraform module by @samcmho in #370
- Add A3-Mega support by @Chris113113 in #371
- Bump version to 1.5.0 by @Chris113113 in #373
Full Changelog: v1.4.2...v1.5.0
v1.4.2
What's Changed
- Merging main to develop by @soumyapani in #327
- staging changes for litgpt by @Chris113113 in #330
- Add litgpt readme by @Chris113113 in #332
- Small fixes to Lit-GPT demo by @gkroiz in #334
- remove profiling setup (currently not used) by @gkroiz in #335
- Adding details to explain MFU calculation by @parambole in #339
- Update rxdm image version by @Chris113113 in #341
- Fix unsupported envvar are set for SLURM cluster #343 by @parambole in #344
- Adding SLURM scripts to setup and launch lit-gpt training by @parambole in #342
- Adding a simple Multi-Node Pingpong PyTorch Workload by @parambole in #347
- Update litgpt LKG, more params for injection by @Chris113113 in #348
- Add a nccl-test sample workload by @Chris113113 in #345
New Contributors
- @parambole made their first contribution in #339
Full Changelog: v1.4.1...v1.4.2
v1.4.1
What's Changed
- Specify tcpx image versions in mig-cos by @soumyapani in #326
Full Changelog: v1.4.0...v1.4.1
v1.4.0
What's Changed
- litgpt demo by @Chris113113 in #319
- litgpt demo with tcpx lkg by @gkroiz in #320
- Add env var injection of num_nodes, datapath, batchsize, mbs by @Chris113113 in #321
- Adding support for existing resource policy. by @soumyapani in #322
- updates to fix issues with sb cluster creation by @stevenBorisko in #323
New Contributors
- @Chris113113 made their first contribution in #319
- @gkroiz made their first contribution in #320
Full Changelog: v1.3.0...v1.4.0
Release V1.3.0
What's Changed
- Borisko/update mtu by @stevenBorisko in #314
- Borisko/disable deletion protection by @stevenBorisko in #315
- Adding support for A2 MIGs in CPT by @soumyapani in #316
Full Changelog: v1.2.1...v1.3.0
Release v1.2.1
What's Changed
- Enabling continuous test integration with GCB by @soumyapani in #297
- Using fixup daemonset for TCPX support and secodary IP range in gke-beta by @soumyapani in #304
- Fixing test failure for secondary Ip range. by @soumyapani in #309
- Fixing continuous test break. by @soumyapani in #311
Full Changelog: v1.1.0...v1.2.1
Release v1.1.0
What's Changed
- Adding support for using stable fleet in GKE node pools. by @soumyapani in #292
- Adding capability to resize node pool in GKE beta. by @soumyapani in #293
Full Changelog: v1.0.1...v1.1.0
v1.0.1
v1.0.0
What's Changed
- Adding support for GKE BYOPP. by @soumyapani in #283
- split repo by machine type by @stevenBorisko in #284
- update gke to have installer daemonsets by @stevenBorisko in #286
Full Changelog: v0.15.0...v1.0.0
v0.15.0
What's Changed
- Reducing GKE nodepool name by removing prefix. by @soumyapani in #278
- Supporting multi-NIC network for GKE cluster by @soumyapani in #279
- Borisko/a3 docs by @stevenBorisko in #280
Full Changelog: v0.14.0...v0.15.0