Optimize CI/CD builders #132

ScottTodd · 2025-03-04T00:12:00Z

We've reached the point where GitHub Actions workflow performance is a bottleneck and it is time for a more principled approach to runner configuration and workflow design.

⏱️ Current performance

Build Linux Packages

Workflows take 50-60 minutes, with 5 minutes of overhead. Cache hit rates can drop to 75%.

🔎 Expand to see details 🔎

Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118749344

Total time: 50m33s
- Initialize containers: 46s
- Restore cache: 8s
- Checkout repo, fetch sources: 3m40s
- Build packages: 45m19s
- Save cache: 22s

ccache report:

Cacheable calls:   324961 / 327035 (99.37%)
  Hits:            253175 / 324961 (77.91%)    <----
    Direct:        229324 / 253175 (90.58%)
    Preprocessed:   23851 / 253175 ( 9.42%)
  Misses:           71786 / 324961 (22.09%)
Uncacheable calls:   1709 / 327035 ( 0.52%)
Errors:               365 / 327035 ( 0.11%)
Local storage:
  Cache size (GB):    0.7 /    0.7 (99.98%)    <----
  Cleanups:          4387
  Hits:            253175 / 324961 (77.91%)
  Misses:           71786 / 324961 (22.09%)

Notes:

5 minutes of overhead, 10% of total time.
Based on prior experience with LLVM-based projects and ccache, Cache hit rate is not linearly related to build time. A hit rate under 90-95% usually falls off a performance cliff.

Build Windows Packages

Workflows take 30-40 minutes, with 12 minutes of overhead. Cache hit rates are near 0%.

🔎 Expand to see details 🔎

Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118748955

Total time: 34m38s (not building all packages yet)
- Setup python: 1m4s
- Install requirements: 20s
- Restore cache: 12s
- Checkout repo, fetch sources: 8m30s
- Build: 22m47s
- Save cache: 34s
- Clean up build dir: 23s

ccache report:

Cacheable calls:   8344 / 8618 (96.82%)
  Hits:              14 / 8344 ( 0.17%)    <----
    Direct:          14 /   14 (100.0%)
    Preprocessed:     0 /   14 ( 0.00%)
  Misses:          8330 / 8344 (99.83%)
Uncacheable calls:  274 / 8618 ( 3.18%)
Local storage:
  Cache size (GB):  0.8 /  0.7 (108.5%)    <----
  Cleanups:        4176
  Hits:              14 / 8344 ( 0.17%)
  Misses:          8330 / 8344 (99.83%)

Notes:

12 minutes of overhead, 35% of total time.
The cache is not working on these runners. Possibly related to how we share a persistent volume between multiple runners.

📈 Future usage to plan for

Wider build matrix: many/all GPU families, multiple Python versions
Build/test pipelines (passing artifacts between jobs, storing logs/artifacts in the cloud)
Release builds

📖 High level areas to design for

Do less work
- Define fine grained build/test slices, only run relevant builds/tests for a given change. Also allow for "try jobs" on demand for any selection
Re-do less work
- Tune cache settings such that build actions get sufficient cache hit rates for good performance
- Prefetch Dockerfiles, system deps, Python packages, git sources, and other data needed on build and test machines
Do work faster
- Use larger runner machines / VMs
- Use faster storage disks (e.g. cloud providers have several tiers of SSDs to choose from)
- Parallelize build actions on a single machine or across multiple machines, remove bottlenecks

✏️ Details for optimization areas

1. Doing less work

With a well-tuned cache, the build system should perform only minimal compilation when unrelated files are changed (except to base deps that truly affect the full graph). Moving beyond a single-machine build though, we'll want a solution to determine which follow up builds, packages, tests, etc. to run.

If we build packages for multiple GPU families, which should a given workflow run include?
How does a developer request that their changes run a different subset of builds/tests?
If a change to one component (e.g. rocRAND) is made, what builds/tests should run?

2. Re-doing less work

ccache (or sccache, or some other similar tool) has a number of options and ways to squeeze more performance out of it:

https://ccache.dev/performance.html
https://ccache.dev/manual/4.10.2.html
Cache maximum size
Cache compression settings
Set compiler flags and file paths to be ccache-friendly
Local vs remote [shared] cache
Cache key selection for save/restore. This is mostly for local caches together with github actions where you have one cache entry to restore and you want it to be as close to your source tree as possible. When using a large remote cache we can generally expect that cache entries from different epochs (i.e. LLVM commits before/after which the build is substantially different) will coexist.

Dependencies can be pre-downloaded:

Dockerfiles
System dependencies (if not in Docker)
Python packages (pip cache)
Source files (git repositories, submodules)

3. Doing work faster

We have our choice of cloud VMs and on-prem machines. For cloud VMs, we can choose between Azure or AWS.
We generally want as many CPU cores as possible, though not all workflow actions can effectively utilize that parallelism
We generally want as fast a disk as possible. Some cloud providers only make high performance disks available separately from VMs
We generally want enough storage for all local build artifacts (200GB+?). Going over a bit is fine but could be expensive in aggregate

📋 Task list

We can take incremental steps towards better performance while designing for the longer term where we'll have more workflows.

Get ccache (or sccache) working reliably on Linux (over 90% cache hit rate, remote storage instead of local with github actions caches?)
Get ccache (or sccache) working reliably on Windows (works locally already; turn on ccache debugging and check the logs)
Check existing cloud project configurations for runners and search for settings to tweak (e.g. SSD types, CPU machine instance types)
Identify targeted/granular builds that we'll want to select between based on modified files/projects
Add monitoring and track build performance over time (more important as we onboard more developers/projects)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CI/CD builders #132

Optimize CI/CD builders #132

ScottTodd commented Mar 4, 2025

Optimize CI/CD builders #132

Optimize CI/CD builders #132

Comments

ScottTodd commented Mar 4, 2025

⏱️ Current performance

Build Linux Packages

Build Windows Packages

📈 Future usage to plan for

📖 High level areas to design for

✏️ Details for optimization areas

1. Doing less work

2. Re-doing less work

3. Doing work faster

📋 Task list