Skip to content

Commit a2f0e5a

Browse files
committed
Update ci.md and fix typo
We always run TPU tests on all PRs now after capacity expansion
1 parent 259b438 commit a2f0e5a

File tree

1 file changed

+10
-9
lines changed

1 file changed

+10
-9
lines changed

.github/ci.md

Lines changed: 10 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ PyTorch and PyTorch/XLA use CI to lint, build, and test each PR that is submitte
66

77
### Pinning PyTorch PR in PyTorch/XLA PR
88

9-
Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new featurues, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/7313). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.
9+
Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new features, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/7313). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.
1010

11-
### Coodinating merges for breaking PyTorch PRs
11+
### Coordinating merges for breaking PyTorch PRs
1212

1313
When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail. Steps for fixing and merging such breaking PyTorch change is as following:
1414
1. Create a PyTorch/XLA PR to fix this issue with `.torch_pin` and rebase with master to ensure the PR is up-to-date with the latest commit on PyTorch/XLA. Once this PR is created, it'll create a commit hash that will be used in step 2. If you have multiple commits in the PR, use the last one's hash. **Important note: When you rebase this PR, it'll create a new commit hash and make the old hash obsolete. Be cautious about rebasing, and if you rebase, make sure you inform the PyTorch PR's author.**
@@ -19,22 +19,23 @@ When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail
1919

2020
### Running TPU tests on PRs
2121

22-
By default, we only run TPU tests on a postsubmit basis to save capacity. If you are making a sensitive change, add the `tpuci` label to your PR. Note that the label must be present before `build_and_test.yml` triggers. If it has already run, make a new commit or rebase to trigger the CI again.
22+
The `build_and_test.yml` workflow runs tests on the TPU in addition to CPU and
23+
GPU. The set of tests run on the TPU is defined in `test/tpu/run_tests.sh`.
2324

2425
## CI Environment
2526

2627
Before the CI in this repository runs, we build a the base dev image. These are the same images we recommend in our VSCode `.devcontainer` setup and nightly build to ensure consistency between environments. We produce variants with and without CUDA, configured in `infra/ansible` (build config) and `infra/tpu-pytorch-releases/dev_images.tf` (build triggers).
2728

2829
The CI runs in two environments:
2930

30-
1. Organization self-hosted runners for CPU and GPU: used for amost every step of the CI. These runners are managed by PyTorch and have access to the shared ECR repository.
31-
2. TPU self-hosted runners: these are managed by us and are only availabe in the `pytorch/xla` repository. See the [_TPU CI_](#tpu-ci) section for more details.
31+
1. Organization self-hosted runners for CPU and GPU: used for almost every step of the CI. These runners are managed by PyTorch and have access to the shared ECR repository.
32+
2. TPU self-hosted runners: these are managed by us and are only available in the `pytorch/xla` repository. See the [_TPU CI_](#tpu-ci) section for more details.
3233

3334
## Build and test (`build_and_test.yml`)
3435

3536
We have two build paths for each CI run:
3637

37-
- `torch_xla`: we build the main package to support for both TPU and GPU[^1], along with a CPU bild of `torch` from HEAD. This build step exports the `torch-xla-wheels` artifact for downstream use in tests.
38+
- `torch_xla`: we build the main package to support for both TPU and GPU[^1], along with a CPU build of `torch` from HEAD. This build step exports the `torch-xla-wheels` artifact for downstream use in tests.
3839
- Some CI tests also require `torchvision`. To reduce flakiness, we compile `torchvision` from [`torch`'s CI pin](https://github.com/pytorch/pytorch/blob/main/.github/ci_commit_pins/vision.txt).
3940
- C++ tests are piggybacked onto the same build and uploaded in the `cpp-test-bin` artifact.
4041
- `torch_xla_cuda_plugin`: the XLA CUDA runtime can be built independently of either `torch` or `torch_xla` -- it depends only on our pinned OpenXLA. Thus, this build should be almost entirely cached, unless your PR changes the XLA pin or adds a patch.
@@ -55,9 +56,9 @@ For the C++ test groups in either case, the test binaries are pre-built during t
5556

5657
### TPU CI
5758

58-
The TPU CI runs only a subset of our tests due to capacity constraints, defined in `_tpu_ci.yml` `test/tpu/run_tests.sh`. The runners themselves are containers in GKE managed by [ARC](https://github.com/actions/actions-runner-controller). The container image is also based on our dev images, with some changes for ARC compatibility. The Dockerfile for this image lives in `test/tpu/Dockerfile`.
59+
The TPU CI workflow is defined in `_tpu_ci.yml`. It runs only a subset of our tests due to capacity constraints, defined in `test/tpu/run_tests.sh`. The runners themselves are containers in GKE managed by [ARC](https://github.com/actions/actions-runner-controller). The container image is also based on our dev images, with some changes for ARC compatibility. The Dockerfile for this image lives in `test/tpu/Dockerfile`.
5960

60-
The actual ARC cluster is defined in Terraform at `infra/tpu-pytorch/tpu_ci.yml`.
61+
The actual ARC cluster is defined in Terraform at `infra/tpu-pytorch/tpu_ci.tf`.
6162

6263
### Reproducing test failures
6364

@@ -70,7 +71,7 @@ If you cannot reproduce the failure or need to inspect the package built in a CI
7071
Our API documentation is generated automatically from the `torch_xla` package with `sphinx`. The workflow to update our static site is defined in `_docs.yml`. The workflow is roughly the following:
7172

7273
- Changes to `master` update the docs at `/master` on the `gh-pages` branch.
73-
- Changes to a release brance update the docs under `/releases/rX.Y`.
74+
- Changes to a release branch update the docs under `/releases/rX.Y`.
7475

7576
By default, we redirect to the latest stable version, defined in [`index.md`](https://github.com/pytorch/xla/blob/gh-pages/index.md).
7677

0 commit comments

Comments
 (0)