Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize CI/CD builders #132

Open
5 tasks
ScottTodd opened this issue Mar 4, 2025 · 0 comments
Open
5 tasks

Optimize CI/CD builders #132

ScottTodd opened this issue Mar 4, 2025 · 0 comments

Comments

@ScottTodd
Copy link
Member

We've reached the point where GitHub Actions workflow performance is a bottleneck and it is time for a more principled approach to runner configuration and workflow design.

⏱️ Current performance

Build Linux Packages

Workflows take 50-60 minutes, with 5 minutes of overhead. Cache hit rates can drop to 75%.

🔎 Expand to see details 🔎

Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118749344

  • Total time: 50m33s

    • Initialize containers: 46s
    • Restore cache: 8s
    • Checkout repo, fetch sources: 3m40s
    • Build packages: 45m19s
    • Save cache: 22s
  • ccache report:

    Cacheable calls:   324961 / 327035 (99.37%)
      Hits:            253175 / 324961 (77.91%)    <----
        Direct:        229324 / 253175 (90.58%)
        Preprocessed:   23851 / 253175 ( 9.42%)
      Misses:           71786 / 324961 (22.09%)
    Uncacheable calls:   1709 / 327035 ( 0.52%)
    Errors:               365 / 327035 ( 0.11%)
    Local storage:
      Cache size (GB):    0.7 /    0.7 (99.98%)    <----
      Cleanups:          4387
      Hits:            253175 / 324961 (77.91%)
      Misses:           71786 / 324961 (22.09%)
    

Notes:

  • 5 minutes of overhead, 10% of total time.
  • Based on prior experience with LLVM-based projects and ccache, Cache hit rate is not linearly related to build time. A hit rate under 90-95% usually falls off a performance cliff.

Build Windows Packages

Workflows take 30-40 minutes, with 12 minutes of overhead. Cache hit rates are near 0%.

🔎 Expand to see details 🔎

Sample run: https://github.com/nod-ai/TheRock/actions/runs/13637190064/job/38118748955

  • Total time: 34m38s (not building all packages yet)

    • Setup python: 1m4s
    • Install requirements: 20s
    • Restore cache: 12s
    • Checkout repo, fetch sources: 8m30s
    • Build: 22m47s
    • Save cache: 34s
    • Clean up build dir: 23s
  • ccache report:

    Cacheable calls:   8344 / 8618 (96.82%)
      Hits:              14 / 8344 ( 0.17%)    <----
        Direct:          14 /   14 (100.0%)
        Preprocessed:     0 /   14 ( 0.00%)
      Misses:          8330 / 8344 (99.83%)
    Uncacheable calls:  274 / 8618 ( 3.18%)
    Local storage:
      Cache size (GB):  0.8 /  0.7 (108.5%)    <----
      Cleanups:        4176
      Hits:              14 / 8344 ( 0.17%)
      Misses:          8330 / 8344 (99.83%)
    

Notes:

  • 12 minutes of overhead, 35% of total time.
  • The cache is not working on these runners. Possibly related to how we share a persistent volume between multiple runners.

📈 Future usage to plan for

  • Wider build matrix: many/all GPU families, multiple Python versions
  • Build/test pipelines (passing artifacts between jobs, storing logs/artifacts in the cloud)
  • Release builds

📖 High level areas to design for

  1. Do less work
    • Define fine grained build/test slices, only run relevant builds/tests for a given change. Also allow for "try jobs" on demand for any selection
  2. Re-do less work
    • Tune cache settings such that build actions get sufficient cache hit rates for good performance
    • Prefetch Dockerfiles, system deps, Python packages, git sources, and other data needed on build and test machines
  3. Do work faster
    • Use larger runner machines / VMs
    • Use faster storage disks (e.g. cloud providers have several tiers of SSDs to choose from)
    • Parallelize build actions on a single machine or across multiple machines, remove bottlenecks

✏️ Details for optimization areas

1. Doing less work

With a well-tuned cache, the build system should perform only minimal compilation when unrelated files are changed (except to base deps that truly affect the full graph). Moving beyond a single-machine build though, we'll want a solution to determine which follow up builds, packages, tests, etc. to run.

  • If we build packages for multiple GPU families, which should a given workflow run include?
  • How does a developer request that their changes run a different subset of builds/tests?
  • If a change to one component (e.g. rocRAND) is made, what builds/tests should run?

2. Re-doing less work

ccache (or sccache, or some other similar tool) has a number of options and ways to squeeze more performance out of it:

  • https://ccache.dev/performance.html
  • https://ccache.dev/manual/4.10.2.html
  • Cache maximum size
  • Cache compression settings
  • Set compiler flags and file paths to be ccache-friendly
  • Local vs remote [shared] cache
  • Cache key selection for save/restore. This is mostly for local caches together with github actions where you have one cache entry to restore and you want it to be as close to your source tree as possible. When using a large remote cache we can generally expect that cache entries from different epochs (i.e. LLVM commits before/after which the build is substantially different) will coexist.

Dependencies can be pre-downloaded:

  • Dockerfiles
  • System dependencies (if not in Docker)
  • Python packages (pip cache)
  • Source files (git repositories, submodules)

3. Doing work faster

  • We have our choice of cloud VMs and on-prem machines. For cloud VMs, we can choose between Azure or AWS.
  • We generally want as many CPU cores as possible, though not all workflow actions can effectively utilize that parallelism
  • We generally want as fast a disk as possible. Some cloud providers only make high performance disks available separately from VMs
  • We generally want enough storage for all local build artifacts (200GB+?). Going over a bit is fine but could be expensive in aggregate

📋 Task list

We can take incremental steps towards better performance while designing for the longer term where we'll have more workflows.

  • Get ccache (or sccache) working reliably on Linux (over 90% cache hit rate, remote storage instead of local with github actions caches?)
  • Get ccache (or sccache) working reliably on Windows (works locally already; turn on ccache debugging and check the logs)
  • Check existing cloud project configurations for runners and search for settings to tweak (e.g. SSD types, CPU machine instance types)
  • Identify targeted/granular builds that we'll want to select between based on modified files/projects
  • Add monitoring and track build performance over time (more important as we onboard more developers/projects)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant