Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sharktank] Update shark-ai CIs with latest install #609

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

archana-ramalingam
Copy link
Collaborator

@archana-ramalingam archana-ramalingam commented Nov 26, 2024

  • Pin iree-turbine version for pre-submit CIs
  • Pull nightly releases of sharktank, shortfin and iree-turbine for nightly CIs

@archana-ramalingam
Copy link
Collaborator Author

We want to pull in iree pre-release rather than stable to catch regressions earlier and fix it.

@archana-ramalingam archana-ramalingam deleted the update-perplexity-ci-install branch November 26, 2024 17:31
@ScottTodd
Copy link
Member

We want to pull in iree pre-release rather than stable to catch regressions earlier and fix it.

Most jobs running on pull_request and push triggers should use pinned versions, so they are predictable. Scheduled jobs can use unpinned / latest versions for early signal.

@archana-ramalingam archana-ramalingam restored the update-perplexity-ci-install branch November 26, 2024 17:43
@archana-ramalingam
Copy link
Collaborator Author

Makes sense. But I think we can at least use stable instead of pinned version (which needs to be updated every time). So pre-submits will have stable to unblock PRs & track shark-ai regressions and nightly will have pre-release tested. Reopening this.

@archana-ramalingam
Copy link
Collaborator Author

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env.
What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

@ScottTodd
Copy link
Member

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env. What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

Can you link specific logs? This PR now is changing much more than originally stated.

I'd like to highlight two things:

@archana-ramalingam
Copy link
Collaborator Author

archana-ramalingam commented Nov 27, 2024

@ScottTodd As you can see the pre-submit CIs seem to be using different iree-turbine/compiler versions although workflow yaml has same command, possible caching env. What's a reliable way to pull the latest stable IREE release? Or do we just pin to latest known good version? If latter, we need a schedule to update good IREE version across all pre-submits.

Can you link specific logs? This PR now is changing much more than originally stated.

I'd like to highlight two things:

Agree the PR is growing bigger, intention was to align all pre-submits.
For context though, here are the logs for inconsistent installation (probably linked to PR 19305). Perplexity CI has pre-release versions but consistent installs but benchmark has inconsistent installs, probably cached installs.

@archana-ramalingam archana-ramalingam changed the title [sharktank] Update perplexity CI install [sharktank] Update shark-ai CIs with latest install Nov 28, 2024
Copy link
Member

@ScottTodd ScottTodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are parts of this that are a step in the right direction, but I'm hesitant to approve because there are several outstanding issues and I have much deeper refactoring in progress for the package setup steps in these workflows that will get at the root causes of those issues:

  • https://github.com/nod-ai/shark-ai/actions/runs/12130121459/job/33819823958?pr=609#step:5:175 attempts to install iree-turbine-3.0.0 but because the runners are persistent and the workflows don't either clean up their working directories or use virtual environments, there are packages already installed that conflict:
    Downloading iree_turbine-3.0.0-py3-none-any.whl (274 kB)
    Installing collected packages: iree-turbine
      Attempting uninstall: iree-turbine
        Found existing installation: iree-turbine 3.1.0
        Uninstalling iree-turbine-3.1.0:
          Successfully uninstalled iree-turbine-3.1.0
    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    shark-ai 3.0.0 requires iree-base-runtime==3.0.*, but you have iree-base-runtime 3.1.0rc20241202 which is incompatible.
    shark-ai 3.0.0 requires shortfin==3.0.0, but you have shortfin 3.1.0.dev0 which is incompatible.
    Successfully installed iree-turbine-3.0.0
    
  • The iree-turbine source install should be replaced with nightly packages (Start publishing nightly Python packages iree-org/iree-turbine#305).
  • The workflows that install sharktank/shortfin from source builds to run integration tests should use prebuilt packages, either dev/nightly/stable (Rework GitHub Actions workflows to build packages --> test packages #584)

@ScottTodd
Copy link
Member

My first bullet point there should be addressed with #640. We could rebase this on top once that lands. I don't have a very clear timeline yet for the two other items.

@archana-ramalingam
Copy link
Collaborator Author

My first bullet point there should be addressed with #640. We could rebase this on top once that lands. I don't have a very clear timeline yet for the two other items.

Great, I believe the only dependency of this PR to work as expected is #640 which resolves the unstable behavior. So when we install iree-turbine==3.0.0, it will get us iree-base-runtime==3.0.0 and iree-base-compiler==3.0.0. Points 2 and 3 can be addressed in a separate PR.
Reason for urgency is every time an IREE regression happens, unpinned IREE crashes pre-submits and blocks PR merges, like today's regression. Pre-submits must only catch sharktank regressions and nightly should catch IREE/sharktank+IREE ones.

@ScottTodd
Copy link
Member

Reason for urgency is every time an IREE regression happens, unpinned IREE crashes pre-submits and blocks PR merges, like today's regression. Pre-submits must only catch sharktank regressions and nightly should catch IREE/sharktank+IREE ones.

I'm more than aware - re-architecting the workflows so this class of issues is removed has been my main priority these past few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants