Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC + PR] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit Sync #11

Merged

Conversation

sjain-stanford
Copy link
Collaborator

@sjain-stanford sjain-stanford commented Oct 18, 2023

Why

When bumping LLVM up, it is crucial to be able to test all downstream repos depending on it to ensure they work in tandem (and not just in isolation).

In the past, LLVM upgrades were simpler because torch-mlir took a hard dependency on mhlo/stablehlo and, in doing so, ensured that the llvm "green commit" (sha1) that torch-mlir and stablehlo were built+tested against was pre-identified. During this time mlir-tcp was developed on a branch of torch-mlir.

This meant when upgrades were needed downstream, we’d simply point to torch-mlir@HEAD (sha4) and pick the llvm-project (sha1) and mhlo/stablehlo (sha3) hashes it’d refer to, since these are already tested to work together. This became our set of green commits (llvm@sha1, stablehlo@sha3, torch-mlir@sha4) for downstream integrations (e.g cruise monorepo).

image

At present the situation is complicated because torch-mlir no longer takes a hard dependency on stablehlo (stablehlo e2e tests disabled).

Here's details from a recent upgrade scenario that motivated this RFC.

We picked torch-mlir@HEAD which was right after the llvm bump in llvm/torch-mlir#2511 pointing to llvm/llvm-project@b44b349, but soon realized (when we started building torch-mlir) that the llvm bazel build upstream was broken:

ERROR: /root/.cache/bazel/_bazel_root/b89349c08f7224396763d14fe35cba11/external/llvm-project/mlir/BUILD.bazel:5837:18: TdGenerate
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOpsInterface.h.inc failed: (Exit 1): mlir-tblgen failed: error executing command ...
                                                                                                                                                    
external/llvm-project/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td:20:9: error: Could not find include file 'mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td'                                                                                                           
include "mlir/Dialect/LLVMIR/BasicPtxBuilderInterface.td"                                                                                                                                                                                                              
        ^                                                                                                                                                                                                                                                              

The bazel fixes followed in a subsequent commit at llvm/llvm-project@28b27c1. Hence llvm had to be re-bumped in torch-mlir (llvm/torch-mlir#2517). However, after a bit more work we hit these failing stablehlo tests, which surfaced the fact that stablehlo pointed to by torch-mlir could no longer be used, and we had to separately identify the sha3 of stablehlo that would build cleanly against sha1 of llvm.

@stablehlo//stablehlo/conversions/tosa/tests:binary.mlir.test            FAILED in 0.7s                                                       
@stablehlo//stablehlo/tests:print_stablehlo.mlir.test                    FAILED in 4.7s

This meant the burden of identifying the llvm green commit (that works across the board) is shifted further downstream from torch-mlir. Incidentally we are in a great position to leverage mlir-tcp to identify the set of green commits, given it already directly depends on each of these repos.

image

What

This PR is an attempt to leverage the mlir-tcp repo as our "proxy" for such downstream integrations, and I think contains everything needed to be able to do that.

How

Specifically, we should now be able to run these from the comfort of mlir-tcp:

bazel test --config=clang_linux @llvm-project//mlir/...
bazel test --config=clang_linux @stablehlo//...
bazel test --config=clang_linux @torch-mlir//...

We provide local_repos.bzl that allows easier local testing of patches that later need to be upstreamed, and while they're being upstreamed we could land them as patches to our http_archive targets.

Note: I include a stablehlo.patch that allows testing stablehlo from mlir-tcp. This is temporary and can be removed once openxla/stablehlo#1810 lands.

This PR also enables each of the 3p test suites as GHA workflows (non-merge gating for now, we can change this). These workflows are automatically skipped unless a change is made to deps.bzl (which usually means bumping 3p deps), as it would be unnecessary to run them for every PR and main commit post-merge.

Here's a snapshot from this PR's workflows, having bumped stablehlo commit.

image

@sjain-stanford sjain-stanford changed the title Patch stablehlo to allow testing it from mlir-tcp LLVM / Torch-MLIR / StableHLO "Green Commits" Syncing Oct 19, 2023
@sjain-stanford sjain-stanford changed the title LLVM / Torch-MLIR / StableHLO "Green Commits" Syncing LLVM / Torch-MLIR / StableHLO "Green Commits" Syncing RFC Oct 19, 2023
@sjain-stanford sjain-stanford changed the title LLVM / Torch-MLIR / StableHLO "Green Commits" Syncing RFC [RFC] {LLVM / Torch-MLIR / StableHLO} Green Commits Syncing Oct 19, 2023
@sjain-stanford sjain-stanford force-pushed the sambhav/stablehlo_patch branch from b95abe0 to 7a20a4b Compare October 19, 2023 13:57
@sjain-stanford sjain-stanford changed the title [RFC] {LLVM / Torch-MLIR / StableHLO} Green Commits Syncing Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync Oct 19, 2023
@sjain-stanford sjain-stanford changed the title Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync Oct 19, 2023
@sjain-stanford sjain-stanford changed the title [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync [RFC] {LLVM / Torch-MLIR / StableHLO} Green Commits Sync Oct 19, 2023
@sjain-stanford sjain-stanford changed the title [RFC] {LLVM / Torch-MLIR / StableHLO} Green Commits Sync [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync Oct 19, 2023
@sjain-stanford sjain-stanford changed the title [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commits Sync [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit Sync Oct 19, 2023
@sjain-stanford sjain-stanford changed the title [RFC] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit Sync [RFC + PR] Use TCP for {LLVM / Torch-MLIR / StableHLO} Green Commit Sync Oct 19, 2023
path = local_stablehlo_repo_path(),
)
else:
STABLEHLO_COMMIT = "5a8bb985f50a679721292b14f97f270344ac64a3"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we're bumping stablehlo commit here

# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
# Also available under a BSD-style license. See LICENSE.

third_party/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where local repos may be cloned

branches:
- main
paths:
- 'deps.bzl'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so the workflow is triggered only when change is made to deps.bzl, as otherwise we shouldn't have to re-run a workflow (3p test status shouldn't change)

@@ -23,7 +25,7 @@ concurrency:

jobs:
ubuntu-build:
name: ubuntu-x86_64
name: ubuntu-x86_64 / mlir-tcp
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name change allows selectively making certain workflows merge gating

@sjain-stanford sjain-stanford marked this pull request as ready for review October 19, 2023 15:39
@sjain-stanford sjain-stanford requested review from sanjoy, a team and navahgar October 19, 2023 15:39
@@ -0,0 +1,145 @@
diff --git a/BUILD.bazel b/BUILD.bazel
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to be able to bazel test @stablehlo//... from mlir-tcp (can be removed once openxla/stablehlo#1810 lands)

Copy link
Collaborator

@navahgar navahgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjain-stanford sjain-stanford merged commit 1852bea into cruise-automation:main Oct 20, 2023
4 checks passed
@sjain-stanford sjain-stanford deleted the sambhav/stablehlo_patch branch October 20, 2023 01:09
srinathava pushed a commit to srinathava/mlir-tcp that referenced this pull request Jun 24, 2024
Updates mlir-tcp to `90768ec2801ed9959144e8a1ca800e34ee2c7f54` resolves
some minor merge conflicts around testing changes in [this
PR](cruise-automation#28).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants