Rework GitHub Actions workflows to build packages --> test packages #584

ScottTodd · 2024-11-21T19:51:55Z

These workflows all currently build shortfin from source, duplicating all the boilerplate to fetch dependencies in some carefully balanced order:

For workflows that run on pull_request and push triggers, we can add a build_dev_packages job similar to https://github.com/nod-ai/shark-ai/blob/main/.github/workflows/build_packages.yml that builds the packages and then have those workflows install artifacts from that job. For workflows that run on schedule, we can either do the same thing, or we can use the already built nightly packages (docs: https://github.com/nod-ai/shark-ai/blob/main/docs/nightly_releases.md).

In both cases, the complexity of package building will be isolated to a few package-oriented workflows and we'll gain confidence that the test jobs are compatible with our releases, so users will be able to use them without needing to build from source either.

Once we have something working, we can optimize the package build to improve CI turnaround times:

cache pip dependencies
cache CMake builds (or the entire build - see what IREE does)
cache Dockerfiles

skip the tracy build variant:

shark-ai/shortfin/build_tools/build_linux_package.sh

Lines 94 to 97 in 06599e9

    
           function build_shortfin() { 
        
             export SHORTFIN_ENABLE_TRACING=ON 
        
             python -m pip wheel --disable-pip-version-check -v -w "${OUTPUT_DIR}" "${REPO_ROOT}/shortfin" 
        
           }

shark-ai/shortfin/setup.py

Line 78 in 06599e9

ENABLE_TRACY = get_env_boolean("SHORTFIN_ENABLE_TRACING", False)

shark-ai/shortfin/setup.py

Lines 260 to 263 in 06599e9

    
           try: 
        
               self.build_default_configuration() 
        
               if ENABLE_TRACY: 
        
                   self.build_tracy_configuration()

See https://github.com/iree-org/iree/blob/main/.github/workflows/pkgci.yml for the shape of this sort of setup in IREE.

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-11-27T23:17:03Z

Proof of concept migration of one workflow: #625. This added 1 minute to total workflow time but has a few scaling benefits. Going to let that sit for a bit and run some more experiments.

The main time sink is installing Python packages (even if already downloaded/cached). Workflows that use persistent self-hosted runners currently don't use venvs, so they risk having packages left over from previous jobs and either installing conflicting versions of packages or failing to install the requested versions entirely. The new setup_venv.py code (forked from IREE) installs the dev packages and requirements sequentially, but we might be able to optimize that a bit while still retaining predictability.

stellaraccident · 2024-11-28T01:13:46Z

You may want to look at using uv as a pip replacement when latency is a concern. I dislike forked tool flows, by it seems like a lot of folks are having a good experience there.

ScottTodd · 2024-11-28T02:37:04Z

Recipes for using uv: https://github.com/astral-sh/uv?tab=readme-ov-file#a-pip-compatible-interface . Definitely worth trying out.

marbre · 2024-11-28T14:23:03Z

If you want to build a package you want to use uv build and not uv pip. The equivalent for python -m pip wheel -v -w wheeldir . would be uv build --wheel -v -o wheeldir .. I would say uv is definitely an alternative, especially as uv venv is a really nice alternative. Furthermore, uv allows to install different Python versions and therefore is also a replacement for pyenv. However, I also faced some issues in the past but it's for sure worth to give it another try.

ScottTodd · 2024-12-02T16:05:23Z

The bottleneck I'd like to optimize is the 2m30s spent installing packages (including deps), not the 1m30s building the shortfin/sharktank/shark-ai packages. See logs at https://github.com/nod-ai/shark-ai/actions/runs/12059301876/job/33628235219?pr=625#step:5:35 :

Wed, 27 Nov 2024 23:02:34 GMT
Installing collected packages: mpmath, typing-extensions, sympy, networkx, MarkupSafe, fsspec, filelock, jinja2, torch
Wed, 27 Nov 2024 23:03:02 GMT
Successfully installed MarkupSafe-2.1.5 filelock-3.13.1 fsspec-2024.6.1 jinja2-3.1.4 mpmath-1.3.0 networkx-3.3 sympy-1.13.1 torch-2.3.0+cpu typing-extensions-4.12.2
...
Wed, 27 Nov 2024 23:03:13 GMT
Installing collected packages: pytz, xxhash, urllib3, tzdata, tqdm, sniffio, six, safetensors, regex, pyyaml, pydantic-core, pyarrow, propcache, packaging, numpy, multidict, idna, h11, frozenlist, dill, click, charset-normalizer, certifi, attrs, annotated-types, aiohappyeyeballs, yarl, uvicorn, requests, python-dateutil, pydantic, multiprocess, iree-base-runtime, iree-base-compiler, gguf, anyio, aiosignal, starlette, pandas, iree-turbine, huggingface-hub, aiohttp, tokenizers, fastapi, transformers, datasets
Wed, 27 Nov 2024 23:04:07 GMT
Successfully installed aiohappyeyeballs-2.4.3 aiohttp-3.11.8 aiosignal-1.3.1 annotated-types-0.7.0 anyio-4.6.2.post1 attrs-24.2.0 certifi-2024.8.30 charset-normalizer-3.4.0 click-8.1.7 datasets-3.0.1 dill-0.3.8 fastapi-0.112.2 frozenlist-1.5.0 gguf-0.10.0 h11-0.14.0 huggingface-hub-0.22.2 idna-3.10 iree-base-compiler-3.0.0 iree-base-runtime-3.0.0 iree-turbine-3.0.0 multidict-6.1.0 multiprocess-0.70.16 numpy-1.26.4 packaging-24.2 pandas-2.2.3 propcache-0.2.0 pyarrow-18.1.0 pydantic-2.10.2 pydantic-core-2.27.1 python-dateutil-2.9.0.post0 pytz-2024.2 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.4.5 six-1.16.0 sniffio-1.3.1 starlette-0.38.6 tokenizers-0.19.1 tqdm-4.67.1 transformers-4.40.0 tzdata-2024.2 urllib3-2.2.3 uvicorn-0.30.6 xxhash-3.5.0 yarl-1.18.0

The build steps can be optimized too, but 1m30s on a standard runner with a (very low) 40% cache hit rate is pretty respectable already.

renxida · 2024-12-03T19:33:19Z

Very hype about how this improves our CI dependencies, especially about the part where we can pin IREE versions such that most CI tasks don't doesn't suffer from IREE regressions.

ScottTodd · 2024-12-03T20:50:08Z

Switching from pip to uv saved about 1 minute of job time on #625. Probably worth it given the relative time scales here.

ScottTodd · 2024-12-03T22:54:08Z

For uv, we can also use https://github.com/astral-sh/setup-uv

Progress on #584. This is expected to save around 10-20 seconds when building packages on standard GitHub-hosted runners: ``` Tue, 03 Dec 2024 11:07:18 GMT [372/380] Linking CXX shared library src/libshortfin.so.3.1.0 Tue, 03 Dec 2024 11:07:18 GMT [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so Tue, 03 Dec 2024 11:07:23 GMT [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test Tue, 03 Dec 2024 11:07:23 GMT [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test Tue, 03 Dec 2024 11:07:36 GMT [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o Tue, 03 Dec 2024 11:07:45 GMT [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so ``` (from these logs: https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738) IREE also disables its tests when building packages: * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260 * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278

The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs of this workflow have been queued: https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This `mi300x-4` runner is probably fit to run this workflow. Also refactored the workflow to not use explicit build steps, which loosens the requirements on installed software and helps make progress on #584.

Many of these workflows are using persistent self-hosted runners, so it looks like they have been reusing the same system-wide Python environment between workflow runs (plus layer of caching on top). This switches to using venvs at `${{ github.workspace }}/.venv` that should be ephemeral, giving us more explicit control over which packages are installed. More work is planned as part of #584 to refactor these workflows further - replacing the package installs code like `pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py` script that uses dev/nightly/stable packages (from an appropriate source). This also disables pip caching, since that is not directly compatible with using venvs. As a result, some workflows are slower now, but they are more predictable in what they install. Good reading for adding caching back: * https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/ * https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages

#646) Splitting this off from #589 to make progress on #584. Tested with ``` CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats Cacheable calls: 626 / 636 (98.43%) Hits: 2 / 626 ( 0.32%) Direct: 2 / 2 (100.0%) Preprocessed: 0 / 2 ( 0.00%) Misses: 624 / 626 (99.68%) Uncacheable calls: 10 / 636 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 3.10%) Hits: 2 / 626 ( 0.32%) Misses: 624 / 626 (99.68%) + ccache --show-stats ccache stats: Cacheable calls: 1252 / 1272 (98.43%) Hits: 550 / 1252 (43.93%) Direct: 550 / 550 (100.0%) Preprocessed: 0 / 550 ( 0.00%) Misses: 702 / 1252 (56.07%) Uncacheable calls: 20 / 1272 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 4.11%) Hits: 550 / 1252 (43.93%) Misses: 702 / 1252 (56.07%) + ccache --show-stats Cacheable calls: 1878 / 1908 (98.43%) Hits: 1098 / 1878 (58.47%) Direct: 1098 / 1098 (100.0%) Preprocessed: 0 / 1098 ( 0.00%) Misses: 780 / 1878 (41.53%) Uncacheable calls: 30 / 1908 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.12%) Hits: 1098 / 1878 (58.47%) Misses: 780 / 1878 (41.53%) CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats ccache stats: Cacheable calls: 3756 / 3816 (98.43%) Hits: 2820 / 3756 (75.08%) Direct: 2820 / 2820 (100.0%) Preprocessed: 0 / 2820 ( 0.00%) Misses: 936 / 3756 (24.92%) Uncacheable calls: 60 / 3816 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.19%) Hits: 2820 / 3756 (75.08%) Misses: 936 / 3756 (24.92%) ``` So we have multiple configurations getting built (Python versions, tracing enable/disabled), but we still get a reasonable number of cache hits. Definitely room to improve there, but better than nothing.

ScottTodd · 2024-12-06T23:26:14Z

I've landed some incremental changes that prepare us for package-based workflows, but I've been a bit skeptical of the complexity that they will introduce.

Here's another data point as motivation: installing into a fresh venv, with package downloads already cached on the system, https://github.com/nod-ai/shark-ai/actions/runs/12206496774/job/34056060889#step:5:32 took 9m20s on the Install pip deps step:

2 minutes for pip install --no-compile -r pytorch-cpu-requirements.txt
6 minutes for pip install --no-compile -r requirements.txt -e sharktank/ shortfin/
20 seconds for pip install --no-compile -f https://iree.dev/pip-release-links.html --src deps -e "git+https://github.com/iree-org/iree-turbine.git#egg=iree-turbine"
15 seconds for pip install -f https://iree.dev/pip-release-links.html --upgrade --pre iree-base-compiler iree-base-runtime

This simplification will help with #584. Nightly releases of iree-turbine are now being built thanks to iree-org/iree-turbine#314 and published at the same index as the other IREE packages thanks to iree-org/iree#19391.

ScottTodd · 2024-12-09T23:18:39Z

The pip install step isn't consistently that slow. Recent runs took ~4m30s instead of that 9m+.
Sample workflow run history: https://github.com/nod-ai/shark-ai/actions/workflows/ci-shark-ai.yml?query=branch%3Amain

For workflows that only use sharktank and not shortfin, the setup is already fast enough: 27s at https://github.com/nod-ai/shark-ai/actions/runs/12243785411/job/34154097452, for example.

The install steps are simpler now, so I'm skeptical about going "full pkgci" across all workflows from a complexity point of view. For workflows that run integration tests using shortfin, a dedicated package build job will make more sense as we make the build more complex... like adding a rust dependency for tokenizers.

Progress on #584. ~~Depends on #666 (the first commit).~~ This is refactors the `build_packages.yml` workflow so it can be used via `workflow_call` as part of a "pkgci" setup, as an alternative to creating a new `pkgci_build_packages.yml` workflow as originally proposed in #589. This lets us reuse the same workflow for building stable, nightly, and dev packages, all across the same matrix of Python versions and operating systems. Package builds take about 2 minutes (wall time) across the full matrix, so we might as well build them all, instead of artificially constraining ourselves to a subset like only Linux on Python 3.11. Triggers for the workflow are now this: Trigger | Scenario | Build type(s) -- | -- | -- `schedule` | Nightly pre-release build | `rc` `workflow_dispatch` | Workflow testing, manual releasing | `rc` default, `stable` and `dev` possible `workflow_call` | Pull request or push "pkgci" dev builds | `dev` default, `stable` and `rc` possible With this workflow behavior: Build type | Version suffix | Cache enabled? | Tracing enabled? | Pushes to release? -- | -- | -- | -- | -- `stable` | None | No | Yes | No `rc` | `rcYYYYMMDD` | No | Yes | Yes `dev` | `.dev0+${{ github.sha }}` | Yes | No | No Tested over at https://github.com/ScottTodd/shark-ai/actions/workflows/build_packages.yml. Example run: https://github.com/ScottTodd/shark-ai/actions/runs/12245900071 (warm cache)

Progress on #584. This is expected to save around 10-20 seconds when building packages on standard GitHub-hosted runners: ``` Tue, 03 Dec 2024 11:07:18 GMT [372/380] Linking CXX shared library src/libshortfin.so.3.1.0 Tue, 03 Dec 2024 11:07:18 GMT [373/380] Creating library symlink src/libshortfin.so.1 src/libshortfin.so Tue, 03 Dec 2024 11:07:23 GMT [374/380] Linking CXX executable src/shortfin/support/shortfin_support_test Tue, 03 Dec 2024 11:07:23 GMT [375/380] Linking CXX executable src/shortfin/array/shortfin_array_test Tue, 03 Dec 2024 11:07:36 GMT [376/380] Building CXX object python/CMakeFiles/shortfin_python_extension.dir/array_host_ops.cc.o Tue, 03 Dec 2024 11:07:45 GMT [377/380] Linking CXX shared module python/_shortfin_default/lib.cpython-311-x86_64-linux-gnu.so ``` (from these logs: https://github.com/nod-ai/shark-ai/actions/runs/12138320160/job/33843543941#step:6:738) IREE also disables its tests when building packages: * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/compiler/setup.py#L260 * https://github.com/iree-org/iree/blob/cbb11f220c69e0106dbfd1533a00237c3a74e7e3/runtime/setup.py#L278

The `mi300-sdxl-kernel` runner has been offline for a few weeks, so runs of this workflow have been queued: https://github.com/nod-ai/shark-ai/actions/workflows/ci-sdxl.yaml. This `mi300x-4` runner is probably fit to run this workflow. Also refactored the workflow to not use explicit build steps, which loosens the requirements on installed software and helps make progress on #584.

Many of these workflows are using persistent self-hosted runners, so it looks like they have been reusing the same system-wide Python environment between workflow runs (plus layer of caching on top). This switches to using venvs at `${{ github.workspace }}/.venv` that should be ephemeral, giving us more explicit control over which packages are installed. More work is planned as part of #584 to refactor these workflows further - replacing the package installs code like `pip install --no-compile -r requirements.txt -r sharktank/requirements-tests.txt -e sharktank/` with a `setup_venv.py` script that uses dev/nightly/stable packages (from an appropriate source). This also disables pip caching, since that is not directly compatible with using venvs. As a result, some workflows are slower now, but they are more predictable in what they install. Good reading for adding caching back: * https://adamj.eu/tech/2023/11/02/github-actions-faster-python-virtual-environments/ * https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages

#646) Splitting this off from #589 to make progress on #584. Tested with ``` CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats Cacheable calls: 626 / 636 (98.43%) Hits: 2 / 626 ( 0.32%) Direct: 2 / 2 (100.0%) Preprocessed: 0 / 2 ( 0.00%) Misses: 624 / 626 (99.68%) Uncacheable calls: 10 / 636 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 3.10%) Hits: 2 / 626 ( 0.32%) Misses: 624 / 626 (99.68%) + ccache --show-stats ccache stats: Cacheable calls: 1252 / 1272 (98.43%) Hits: 550 / 1252 (43.93%) Direct: 550 / 550 (100.0%) Preprocessed: 0 / 550 ( 0.00%) Misses: 702 / 1252 (56.07%) Uncacheable calls: 20 / 1272 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 4.11%) Hits: 550 / 1252 (43.93%) Misses: 702 / 1252 (56.07%) + ccache --show-stats Cacheable calls: 1878 / 1908 (98.43%) Hits: 1098 / 1878 (58.47%) Direct: 1098 / 1098 (100.0%) Preprocessed: 0 / 1098 ( 0.00%) Misses: 780 / 1878 (41.53%) Uncacheable calls: 30 / 1908 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.12%) Hits: 1098 / 1878 (58.47%) Misses: 780 / 1878 (41.53%) CACHE_DIR=/tmp/shortfin/ sudo -E ./shortfin/build_tools/build_linux_package.sh + ccache --show-stats ccache stats: Cacheable calls: 3756 / 3816 (98.43%) Hits: 2820 / 3756 (75.08%) Direct: 2820 / 2820 (100.0%) Preprocessed: 0 / 2820 ( 0.00%) Misses: 936 / 3756 (24.92%) Uncacheable calls: 60 / 3816 ( 1.57%) Local storage: Cache size (GB): 0.1 / 2.0 ( 5.19%) Hits: 2820 / 3756 (75.08%) Misses: 936 / 3756 (24.92%) ``` So we have multiple configurations getting built (Python versions, tracing enable/disabled), but we still get a reasonable number of cache hits. Definitely room to improve there, but better than nothing.

ScottTodd assigned ScottTodd, marbre and renxida Nov 21, 2024

This was referenced Nov 22, 2024

Create pkgci.yml and pkgci_build_packages.yml. #589

Draft

[sharktank] Update shark-ai CIs with latest install #609

Open

[CI] Add punet export test #623

Merged

Convert ci-shark-ai.yml to pkgci_shark_ai.yml. #625

Draft

ScottTodd mentioned this issue Dec 2, 2024

Add Baseline for SGLang Benchmark Test #602

Merged

This was referenced Dec 3, 2024

[shortfin] Disable C++ tests in package builds. #637

Merged

Install python packages into venvs across workflows. #640

Merged

This was referenced Dec 4, 2024

Switch ci-sdxl github runner and refactor workflow. #644

Merged

[shortfin] Support ccache in package builds and make tracing optional. #646

Merged

ScottTodd mentioned this issue Dec 6, 2024

Switch workflows to use new nightly iree-turbine packages. #657

Merged

ScottTodd mentioned this issue Dec 9, 2024

Add dev package support to build_packages.yml. #667

Merged

ScottTodd mentioned this issue Dec 12, 2024

Enable tokenizers in shortfin packages on Linux x86_64. #688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework GitHub Actions workflows to build packages --> test packages #584

Rework GitHub Actions workflows to build packages --> test packages #584

ScottTodd commented Nov 21, 2024

ScottTodd commented Nov 27, 2024

stellaraccident commented Nov 28, 2024

ScottTodd commented Nov 28, 2024

marbre commented Nov 28, 2024

ScottTodd commented Dec 2, 2024

renxida commented Dec 3, 2024

ScottTodd commented Dec 3, 2024

ScottTodd commented Dec 3, 2024

ScottTodd commented Dec 6, 2024

ScottTodd commented Dec 9, 2024

Rework GitHub Actions workflows to build packages --> test packages #584

Rework GitHub Actions workflows to build packages --> test packages #584

Comments

ScottTodd commented Nov 21, 2024

ScottTodd commented Nov 27, 2024

stellaraccident commented Nov 28, 2024

ScottTodd commented Nov 28, 2024

marbre commented Nov 28, 2024

ScottTodd commented Dec 2, 2024

renxida commented Dec 3, 2024

ScottTodd commented Dec 3, 2024

ScottTodd commented Dec 3, 2024

ScottTodd commented Dec 6, 2024

ScottTodd commented Dec 9, 2024