What's Changed
- Fix NCCL_SOCKET_IFNAME typo in values.yaml under nccltest/gke by @hmhv1222 in #357
- Replace hardcoded parameters with environment variables in litgpt_container.sh by @samcmho in #359
- Update setup_and_launch_training.sh by @samcmho in #361
- Update README.md for all customers to cover all-to-all by @samcmho in #362
- remove default node pool deletion by @stevenBorisko in #351
- Pirillo/litgpt nvtx by @Chris113113 in #354
- Update lit_gpt commit to PyTorch 2.2 by @Chris113113 in #364
- Update setup_and_launch_training.sh by @samcmho in #363
- Merging Develop -> Main for sample_workloads changes by @Chris113113 in #366
- Add A3-Megagpu-8g SKU to tool. by @Chris113113 in #367
- Update NCCL link and rename a3-mega GKE in terraform module by @samcmho in #370
- Add A3-Mega support by @Chris113113 in #371
- Bump version to 1.5.0 by @Chris113113 in #373
Full Changelog: v1.4.2...v1.5.0