You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've reached the point where GitHub Actions workflow performance is a bottleneck and it is time for a more principled approach to runner configuration and workflow design.
⏱️ Current performance
Build Linux Packages
Workflows take 50-60 minutes, with 5 minutes of overhead. Cache hit rates can drop to 75%.
Based on prior experience with LLVM-based projects and ccache, Cache hit rate is not linearly related to build time. A hit rate under 90-95% usually falls off a performance cliff.
Build Windows Packages
Workflows take 30-40 minutes, with 12 minutes of overhead. Cache hit rates are near 0%.
Build/test pipelines (passing artifacts between jobs, storing logs/artifacts in the cloud)
Release builds
📖 High level areas to design for
Do less work
Define fine grained build/test slices, only run relevant builds/tests for a given change. Also allow for "try jobs" on demand for any selection
Re-do less work
Tune cache settings such that build actions get sufficient cache hit rates for good performance
Prefetch Dockerfiles, system deps, Python packages, git sources, and other data needed on build and test machines
Do work faster
Use larger runner machines / VMs
Use faster storage disks (e.g. cloud providers have several tiers of SSDs to choose from)
Parallelize build actions on a single machine or across multiple machines, remove bottlenecks
✏️ Details for optimization areas
1. Doing less work
With a well-tuned cache, the build system should perform only minimal compilation when unrelated files are changed (except to base deps that truly affect the full graph). Moving beyond a single-machine build though, we'll want a solution to determine which follow up builds, packages, tests, etc. to run.
If we build packages for multiple GPU families, which should a given workflow run include?
How does a developer request that their changes run a different subset of builds/tests?
If a change to one component (e.g. rocRAND) is made, what builds/tests should run?
2. Re-doing less work
ccache (or sccache, or some other similar tool) has a number of options and ways to squeeze more performance out of it:
Set compiler flags and file paths to be ccache-friendly
Local vs remote [shared] cache
Cache key selection for save/restore. This is mostly for local caches together with github actions where you have one cache entry to restore and you want it to be as close to your source tree as possible. When using a large remote cache we can generally expect that cache entries from different epochs (i.e. LLVM commits before/after which the build is substantially different) will coexist.
Dependencies can be pre-downloaded:
Dockerfiles
System dependencies (if not in Docker)
Python packages (pip cache)
Source files (git repositories, submodules)
3. Doing work faster
We have our choice of cloud VMs and on-prem machines. For cloud VMs, we can choose between Azure or AWS.
We generally want as many CPU cores as possible, though not all workflow actions can effectively utilize that parallelism
We generally want as fast a disk as possible. Some cloud providers only make high performance disks available separately from VMs
We generally want enough storage for all local build artifacts (200GB+?). Going over a bit is fine but could be expensive in aggregate
📋 Task list
We can take incremental steps towards better performance while designing for the longer term where we'll have more workflows.
Get ccache (or sccache) working reliably on Linux (over 90% cache hit rate, remote storage instead of local with github actions caches?)
Get ccache (or sccache) working reliably on Windows (works locally already; turn on ccache debugging and check the logs)
Check existing cloud project configurations for runners and search for settings to tweak (e.g. SSD types, CPU machine instance types)
Identify targeted/granular builds that we'll want to select between based on modified files/projects
Add monitoring and track build performance over time (more important as we onboard more developers/projects)
The text was updated successfully, but these errors were encountered:
We've reached the point where GitHub Actions workflow performance is a bottleneck and it is time for a more principled approach to runner configuration and workflow design.
⏱️ Current performance
Build Linux Packages
Workflows take 50-60 minutes, with 5 minutes of overhead. Cache hit rates can drop to 75%.
🔎 Expand to see details 🔎
Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118749344
Total time: 50m33s
ccache report:
Notes:
Build Windows Packages
Workflows take 30-40 minutes, with 12 minutes of overhead. Cache hit rates are near 0%.
🔎 Expand to see details 🔎
Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118748955
Total time: 34m38s (not building all packages yet)
ccache report:
Notes:
📈 Future usage to plan for
📖 High level areas to design for
✏️ Details for optimization areas
1. Doing less work
With a well-tuned cache, the build system should perform only minimal compilation when unrelated files are changed (except to base deps that truly affect the full graph). Moving beyond a single-machine build though, we'll want a solution to determine which follow up builds, packages, tests, etc. to run.
2. Re-doing less work
ccache (or sccache, or some other similar tool) has a number of options and ways to squeeze more performance out of it:
Dependencies can be pre-downloaded:
3. Doing work faster
📋 Task list
We can take incremental steps towards better performance while designing for the longer term where we'll have more workflows.
The text was updated successfully, but these errors were encountered: