You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/ci.md
+10-9Lines changed: 10 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,9 @@ PyTorch and PyTorch/XLA use CI to lint, build, and test each PR that is submitte
6
6
7
7
### Pinning PyTorch PR in PyTorch/XLA PR
8
8
9
-
Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new featurues, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/7313). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.
9
+
Sometimes a PyTorch/XLA PR needs to be pinned to a specific PyTorch PR to test new features, fix breaking changes, etc. Since PyTorch/XLA CI pulls from PyTorch master by default, we need to manually provided a PyTorch pin. In a PyTorch/XLA PR, PyTorch an be manually pinned by creating a `.torch_pin` file at the root of the repository. The `.torch_pin` should have the corresponding PyTorch PR number prefixed by "#". Take a look at [example here](https://github.com/pytorch/xla/pull/7313). Before the PyTorch/XLA PR gets merged, the `.torch_pin` must be deleted.
10
10
11
-
### Coodinating merges for breaking PyTorch PRs
11
+
### Coordinating merges for breaking PyTorch PRs
12
12
13
13
When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail. Steps for fixing and merging such breaking PyTorch change is as following:
14
14
1. Create a PyTorch/XLA PR to fix this issue with `.torch_pin` and rebase with master to ensure the PR is up-to-date with the latest commit on PyTorch/XLA. Once this PR is created, it'll create a commit hash that will be used in step 2. If you have multiple commits in the PR, use the last one's hash. **Important note: When you rebase this PR, it'll create a new commit hash and make the old hash obsolete. Be cautious about rebasing, and if you rebase, make sure you inform the PyTorch PR's author.**
@@ -19,22 +19,23 @@ When PyTorch PR introduces a breaking change, its PyTorch/XLA CI tests will fail
19
19
20
20
### Running TPU tests on PRs
21
21
22
-
By default, we only run TPU tests on a postsubmit basis to save capacity. If you are making a sensitive change, add the `tpuci` label to your PR. Note that the label must be present before `build_and_test.yml` triggers. If it has already run, make a new commit or rebase to trigger the CI again.
22
+
The `build_and_test.yml` workflow runs tests on the TPU in addition to CPU and
23
+
GPU. The set of tests run on the TPU is defined in `test/tpu/run_tests.sh`.
23
24
24
25
## CI Environment
25
26
26
27
Before the CI in this repository runs, we build a the base dev image. These are the same images we recommend in our VSCode `.devcontainer` setup and nightly build to ensure consistency between environments. We produce variants with and without CUDA, configured in `infra/ansible` (build config) and `infra/tpu-pytorch-releases/dev_images.tf` (build triggers).
27
28
28
29
The CI runs in two environments:
29
30
30
-
1. Organization self-hosted runners for CPU and GPU: used for amost every step of the CI. These runners are managed by PyTorch and have access to the shared ECR repository.
31
-
2. TPU self-hosted runners: these are managed by us and are only availabe in the `pytorch/xla` repository. See the [_TPU CI_](#tpu-ci) section for more details.
31
+
1. Organization self-hosted runners for CPU and GPU: used for almost every step of the CI. These runners are managed by PyTorch and have access to the shared ECR repository.
32
+
2. TPU self-hosted runners: these are managed by us and are only available in the `pytorch/xla` repository. See the [_TPU CI_](#tpu-ci) section for more details.
32
33
33
34
## Build and test (`build_and_test.yml`)
34
35
35
36
We have two build paths for each CI run:
36
37
37
-
-`torch_xla`: we build the main package to support for both TPU and GPU[^1], along with a CPU bild of `torch` from HEAD. This build step exports the `torch-xla-wheels` artifact for downstream use in tests.
38
+
-`torch_xla`: we build the main package to support for both TPU and GPU[^1], along with a CPU build of `torch` from HEAD. This build step exports the `torch-xla-wheels` artifact for downstream use in tests.
38
39
- Some CI tests also require `torchvision`. To reduce flakiness, we compile `torchvision` from [`torch`'s CI pin](https://github.com/pytorch/pytorch/blob/main/.github/ci_commit_pins/vision.txt).
39
40
- C++ tests are piggybacked onto the same build and uploaded in the `cpp-test-bin` artifact.
40
41
-`torch_xla_cuda_plugin`: the XLA CUDA runtime can be built independently of either `torch` or `torch_xla` -- it depends only on our pinned OpenXLA. Thus, this build should be almost entirely cached, unless your PR changes the XLA pin or adds a patch.
@@ -55,9 +56,9 @@ For the C++ test groups in either case, the test binaries are pre-built during t
55
56
56
57
### TPU CI
57
58
58
-
The TPU CI runs only a subset of our tests due to capacity constraints, defined in`_tpu_ci.yml``test/tpu/run_tests.sh`. The runners themselves are containers in GKE managed by [ARC](https://github.com/actions/actions-runner-controller). The container image is also based on our dev images, with some changes for ARC compatibility. The Dockerfile for this image lives in `test/tpu/Dockerfile`.
59
+
The TPU CI workflow is defined in `_tpu_ci.yml`. It runs only a subset of our tests due to capacity constraints, defined in `test/tpu/run_tests.sh`. The runners themselves are containers in GKE managed by [ARC](https://github.com/actions/actions-runner-controller). The container image is also based on our dev images, with some changes for ARC compatibility. The Dockerfile for this image lives in `test/tpu/Dockerfile`.
59
60
60
-
The actual ARC cluster is defined in Terraform at `infra/tpu-pytorch/tpu_ci.yml`.
61
+
The actual ARC cluster is defined in Terraform at `infra/tpu-pytorch/tpu_ci.tf`.
61
62
62
63
### Reproducing test failures
63
64
@@ -70,7 +71,7 @@ If you cannot reproduce the failure or need to inspect the package built in a CI
70
71
Our API documentation is generated automatically from the `torch_xla` package with `sphinx`. The workflow to update our static site is defined in `_docs.yml`. The workflow is roughly the following:
71
72
72
73
- Changes to `master` update the docs at `/master` on the `gh-pages` branch.
73
-
- Changes to a release brance update the docs under `/releases/rX.Y`.
74
+
- Changes to a release branch update the docs under `/releases/rX.Y`.
74
75
75
76
By default, we redirect to the latest stable version, defined in [`index.md`](https://github.com/pytorch/xla/blob/gh-pages/index.md).
0 commit comments