Skip to content

Commit

Permalink
[RFC + PR] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit S…
Browse files Browse the repository at this point in the history
…ync (#11)

## Why
When bumping LLVM up, it is crucial to be able to test all downstream
repos depending on it to ensure they work **in tandem** (and not just in
isolation).

In the past, LLVM upgrades were simpler because torch-mlir took a hard
dependency on mhlo/stablehlo and, in doing so, ensured that the llvm
"green commit" (sha1) that torch-mlir and stablehlo were built+tested
against was pre-identified. During this time mlir-tcp was developed on a
branch of torch-mlir.

This meant when upgrades were needed downstream, we’d simply point to
torch-mlir@HEAD (sha4) and pick the llvm-project (sha1) and
mhlo/stablehlo (sha3) hashes it’d refer to, since these are already
tested to work together. This became our set of green commits
(llvm@sha1, stablehlo@sha3, torch-mlir@sha4) for downstream integrations
(e.g cruise monorepo).

<img width="500" alt="image"
src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/42078522-466c-449f-8d7e-496facc1447c">

At present the situation is complicated because torch-mlir no longer
takes a hard dependency on stablehlo (stablehlo e2e tests
[disabled](llvm/torch-mlir#2460)).

Here's details from a recent upgrade scenario that motivated this RFC.

We picked torch-mlir@HEAD which was right after the llvm bump in
llvm/torch-mlir#2511 pointing to
llvm/llvm-project@b44b349,
but soon realized (when we started building torch-mlir) that the llvm
bazel build upstream was broken:

```
ERROR: /root/.cache/bazel/_bazel_root/b89349c08f7224396763d14fe35cba11/external/llvm-project/mlir/BUILD.bazel:5837:18: TdGenerate
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOpsInterface.h.inc failed: (Exit 1): mlir-tblgen failed: error executing command ...
                                                                                                                                                    
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td:20:9: error: Could not find include file 'mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td'                                                                                                           
include "mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td"                                                                                                                                                                                                              
        ^                                                                                                                                                                                                                                                              
```

The bazel fixes followed in a subsequent commit at
llvm/llvm-project@28b27c1.
Hence llvm had to be re-bumped in torch-mlir
(llvm/torch-mlir#2517). However, after a bit
more work we hit these failing stablehlo tests, which surfaced the fact
that stablehlo pointed to by torch-mlir could no longer be used, and we
had to separately identify the sha3 of stablehlo that would build
cleanly against sha1 of llvm.

```
@stablehlo//stablehlo/conversions/tosa/tests:binary.mlir.test            FAILED in 0.7s                                                       
@stablehlo//stablehlo/tests:print_stablehlo.mlir.test                    FAILED in 4.7s
```


This meant the burden of identifying the llvm green commit (that works
across the board) is shifted further downstream from torch-mlir.
Incidentally we are in a great position to leverage mlir-tcp to identify
the set of green commits, given it already directly depends on each of
these repos.

<img width="500" alt="image"
src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/cadd38c4-71ec-45b0-8888-85ac0bfd4e99">


## What
This PR is an attempt to leverage the mlir-tcp repo as our "proxy" for
such downstream integrations, and _I think_ contains everything needed
to be able to do that.

## How
Specifically, we should now be able to run these from the comfort of
`mlir-tcp`:

```shell
bazel test --config=clang_linux @llvm-project//mlir/...
bazel test --config=clang_linux @stablehlo//...
bazel test --config=clang_linux @torch-mlir//...
```

We provide `local_repos.bzl` that allows easier local testing of patches
that later need to be upstreamed, and while they're being upstreamed we
could land them as patches to our `http_archive` targets.

Note: I include a `stablehlo.patch` that allows testing stablehlo from
`mlir-tcp`. This is temporary and can be removed once
openxla/stablehlo#1810 lands.

This PR also enables each of the 3p test suites as GHA workflows
(non-merge gating for now, we can change this). These workflows are
automatically skipped unless a change is made to `deps.bzl` (which
usually means bumping 3p deps), as it would be unnecessary to run them
for every PR and `main` commit post-merge.

Here's a snapshot from this PR's workflows, having bumped stablehlo
commit.

<img width="747" alt="image"
src="https://github.com/cruise-automation/mlir-tcp/assets/19234106/e535ed39-33f7-4941-958c-3a5d0c0adef6">
  • Loading branch information
sjain-stanford authored Oct 20, 2023
1 parent cb22a7c commit 1852bea
Show file tree
Hide file tree
Showing 10 changed files with 527 additions and 36 deletions.
6 changes: 6 additions & 0 deletions .bazelignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

third_party/
82 changes: 82 additions & 0 deletions .github/workflows/bazelBuildAndTestLlvm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

name: Bazel Build and Test (llvm-project)

# Only run when llvm-project hash changes (deps.bzl)
on:
pull_request:
branches:
- main
paths:
- 'deps.bzl'
push:
branches:
- main
paths:
- 'deps.bzl'
workflow_dispatch:

# Ensure that only a single job or workflow using the same
# concurrency group will run at a time. This would cancel
# any in-progress jobs in the same github workflow and github
# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true


jobs:
ubuntu-build:
name: ubuntu-x86_64 / llvm-project
runs-on: ubuntu-latest

steps:
- name: Checkout mlir-tcp
uses: actions/checkout@v3

# Continually update cache even if there's a "hit" during
# restore to avoid the cache going stale over time
# https://github.com/actions/cache/blob/main/tips-and-workarounds.md#update-a-cache
- name: Setup cache for bazel
uses: actions/cache@v3
with:
path: ~/.cache/bazel
key: llvm-project-bazel-build-cache-${{ runner.os }}-${{ github.sha }}
restore-keys: |
llvm-project-bazel-build-cache-${{ runner.os }}
# Change bazel cache directory to root ownership
# to allow writing to it from within the docker container.
# If no cache hits, this directory is not present
# so don't run chown (will error otherwise).
- name: Set bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R root:root "${HOME}/.cache/bazel"
fi
- name: Build docker image
run: |
docker build -f docker/Dockerfile \
-t mlir-tcp:ci \
.
- name: Bazel build and test llvm-project
run: |
docker run --rm \
-v "$(pwd)":"/opt/src/mlir-tcp" \
-v "${HOME}/.cache/bazel":"/root/.cache/bazel" \
mlir-tcp:ci \
bazel test --config=clang_linux @llvm-project//mlir/...
# Switch back bazel cache directory to user ownership
# to allow GHA post-cache step to save cache without
# permissions issue.
- name: Switch bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R "$USER":"$USER" "${HOME}/.cache/bazel"
fi
82 changes: 82 additions & 0 deletions .github/workflows/bazelBuildAndTestStablehlo.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

name: Bazel Build and Test (stablehlo)

# Only run when stablehlo hash changes (deps.bzl)
on:
pull_request:
branches:
- main
paths:
- 'deps.bzl'
push:
branches:
- main
paths:
- 'deps.bzl'
workflow_dispatch:

# Ensure that only a single job or workflow using the same
# concurrency group will run at a time. This would cancel
# any in-progress jobs in the same github workflow and github
# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true


jobs:
ubuntu-build:
name: ubuntu-x86_64 / stablehlo
runs-on: ubuntu-latest

steps:
- name: Checkout mlir-tcp
uses: actions/checkout@v3

# Continually update cache even if there's a "hit" during
# restore to avoid the cache going stale over time
# https://github.com/actions/cache/blob/main/tips-and-workarounds.md#update-a-cache
- name: Setup cache for bazel
uses: actions/cache@v3
with:
path: ~/.cache/bazel
key: stablehlo-bazel-build-cache-${{ runner.os }}-${{ github.sha }}
restore-keys: |
stablehlo-bazel-build-cache-${{ runner.os }}
# Change bazel cache directory to root ownership
# to allow writing to it from within the docker container.
# If no cache hits, this directory is not present
# so don't run chown (will error otherwise).
- name: Set bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R root:root "${HOME}/.cache/bazel"
fi
- name: Build docker image
run: |
docker build -f docker/Dockerfile \
-t mlir-tcp:ci \
.
- name: Bazel build and test stablehlo
run: |
docker run --rm \
-v "$(pwd)":"/opt/src/mlir-tcp" \
-v "${HOME}/.cache/bazel":"/root/.cache/bazel" \
mlir-tcp:ci \
bazel test --config=clang_linux @stablehlo//...
# Switch back bazel cache directory to user ownership
# to allow GHA post-cache step to save cache without
# permissions issue.
- name: Switch bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R "$USER":"$USER" "${HOME}/.cache/bazel"
fi
12 changes: 7 additions & 5 deletions .github/workflows/bazelBuildAndTestTcp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

name: Bazel Build and Test
name: Bazel Build and Test (mlir-tcp)

on:
pull_request:
branches: [ main ]
branches:
- main
push:
branches: [ main ]
branches:
- main
workflow_dispatch:

# Ensure that only a single job or workflow using the same
Expand All @@ -23,7 +25,7 @@ concurrency:

jobs:
ubuntu-build:
name: ubuntu-x86_64
name: ubuntu-x86_64 / mlir-tcp
runs-on: ubuntu-latest

steps:
Expand All @@ -32,7 +34,7 @@ jobs:

# Continually update cache even if there's a "hit" during
# restore to avoid the cache going stale over time
# https://github.com/actions/cache/blob/main/workarounds.md#update-a-cache
# https://github.com/actions/cache/blob/main/tips-and-workarounds.md#update-a-cache
- name: Setup cache for bazel
uses: actions/cache@v3
with:
Expand Down
82 changes: 82 additions & 0 deletions .github/workflows/bazelBuildAndTestTorchmlir.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

name: Bazel Build and Test (torch-mlir)

# Only run when torch-mlir hash changes (deps.bzl)
on:
pull_request:
branches:
- main
paths:
- 'deps.bzl'
push:
branches:
- main
paths:
- 'deps.bzl'
workflow_dispatch:

# Ensure that only a single job or workflow using the same
# concurrency group will run at a time. This would cancel
# any in-progress jobs in the same github workflow and github
# ref (e.g. refs/heads/main or refs/pull/<pr_number>/merge).
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true


jobs:
ubuntu-build:
name: ubuntu-x86_64 / torch-mlir
runs-on: ubuntu-latest

steps:
- name: Checkout mlir-tcp
uses: actions/checkout@v3

# Continually update cache even if there's a "hit" during
# restore to avoid the cache going stale over time
# https://github.com/actions/cache/blob/main/tips-and-workarounds.md#update-a-cache
- name: Setup cache for bazel
uses: actions/cache@v3
with:
path: ~/.cache/bazel
key: torch-mlir-bazel-build-cache-${{ runner.os }}-${{ github.sha }}
restore-keys: |
torch-mlir-bazel-build-cache-${{ runner.os }}
# Change bazel cache directory to root ownership
# to allow writing to it from within the docker container.
# If no cache hits, this directory is not present
# so don't run chown (will error otherwise).
- name: Set bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R root:root "${HOME}/.cache/bazel"
fi
- name: Build docker image
run: |
docker build -f docker/Dockerfile \
-t mlir-tcp:ci \
.
- name: Bazel build and test torch-mlir
run: |
docker run --rm \
-v "$(pwd)":"/opt/src/mlir-tcp" \
-v "${HOME}/.cache/bazel":"/root/.cache/bazel" \
mlir-tcp:ci \
bazel test --config=clang_linux @torch-mlir//...
# Switch back bazel cache directory to user ownership
# to allow GHA post-cache step to save cache without
# permissions issue.
- name: Switch bazel cache permissions
run: |
if [ -d "${HOME}/.cache/bazel" ]; then
sudo chown -R "$USER":"$USER" "${HOME}/.cache/bazel"
fi
14 changes: 10 additions & 4 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
/bazel-bin
/bazel-out
/bazel-mlir-tcp
/bazel-testlogs
# Licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

bazel-bin
bazel-out
bazel-mlir-tcp
bazel-testlogs
third_party/
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Tensor Compute Primitives

Mid-level intermediate representation for machine learning programs.

![Bazel Build](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestTcp.yml/badge.svg)
[![Bazel Build and Test (mlir-tcp)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestTcp.yml/badge.svg)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestTcp.yml)

:construction: **This project is under active development (WIP).**

Expand Down Expand Up @@ -40,3 +40,15 @@ find . -type f -name "*.cpp" -o -name "*.h" | xargs clang-format -i
# buildifer
bazel run --config=clang_linux //:buildifier
```

When bumping upstream dependencies (LLVM, Torch-MLIR, StableHLO), you may validate the set of "green commits" by running the corresponding third-party tests:
```shell
bazel test --config=clang_linux @llvm-project//mlir/...
bazel test --config=clang_linux @torch-mlir//...
bazel test --config=clang_linux @stablehlo//...
```

The following CI workflows are automatically triggered anytime upstream dependencies (`deps.bzl`) are updated:
- [![Bazel Build and Test (llvm-project)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestLlvm.yml/badge.svg)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestLlvm.yml)
- [![Bazel Build and Test (torch-mlir)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestTorchmlir.yml/badge.svg)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestTorchmlir.yml)
- [![Bazel Build and Test (stablehlo)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestStablehlo.yml/badge.svg)](https://github.com/cruise-automation/mlir-tcp/actions/workflows/bazelBuildAndTestStablehlo.yml)
Loading

0 comments on commit 1852bea

Please sign in to comment.