Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bundle #586

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

bundle #586

wants to merge 7 commits into from

Conversation

rhubert
Copy link
Contributor

@rhubert rhubert commented Sep 23, 2024

Bundle the results of all checkoutSteps needed to build a package and provide a bundle-config file to be able to rebuild the package only with the bundle.

Such a bundle can be used to build on a air-gapped system or to archive all sources of a build. When building from the bundle only the bundle is extracted but no other checkoutScripts are run.

There is one issue with the actual URL-Scm extraction. The source workspace contains both, the original downloaded file and the extracted sources. This unnecessarily doubles the size of the bundle and - since the bundle-extraction uses the UrlScm as well - produces a different workspace hash when the bundle is extracted. That's why I changed to download the original file into workspace/../_download where also the .extracted file is placed. This change makes the unittest-failing ATM as the ../_download folder is always the same when using a tempDir.

I'm not sure how to proceed here:

  • fix the unit-tests
  • use a workspace folder (e.g. workspace/.bob-download/) for the downloaded files and exclude this directory when hashing / bundling?
  • leave the download location as is and ignore the downloaded + .extracted file?
  • ...?

@jkloetzke
Copy link
Member

I would argue that the tarball download optimization is some welcome but unrelated optimization. I would move it to some separate PR that should probably be merged first.

Reusing the URL SCM for the bundles is IMHO not the right approach. It should instead work like the archive stuff. Right now binary artifacts are used for saving or restoring package steps. What we need here is to save and restore checkout steps. In the best case we can build on the archive module and reuse most code from there.

From a more general angle: should this work for indeterministic checkouts too? I would argue against this and only bundle deterministic checkouts. But if there is a good reason to decide otherwise I'm open to it. It's just that my gut feeling is that it will get nasty to get all corner cases correct...

Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 87.03704% with 14 lines in your changes missing coverage. Please review.

Project coverage is 88.93%. Comparing base (d021b99) to head (b03033f).

Files with missing lines Patch % Lines
pym/bob/archive.py 84.93% 11 Missing ⚠️
pym/bob/builder.py 92.00% 2 Missing ⚠️
pym/bob/cmds/build/build.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #586      +/-   ##
==========================================
+ Coverage   88.87%   88.93%   +0.06%     
==========================================
  Files          48       48              
  Lines       15550    15644      +94     
==========================================
+ Hits        13820    13913      +93     
- Misses       1730     1731       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rhubert
Copy link
Contributor Author

rhubert commented Jan 4, 2025

I added a new implementation using the archive stuff. Since adding files to the final bundle using tarfile can not be done in parallel this finalization step is necessary. Maybe this could be optimized somehow but it seams to be impossible to synchronize this asyncio stuff. It's either unable to pickle asyncio.future objects, or the lock is generated in the wrong loop. 🤷
With this finalization method it's somehow simple and works - I don't think time is that much relevant when a bundle is packed.

To get the blackbox test working #606 is required. As of today I haven't tested bundling / unbundling a larger, real world project. I'll do this in the next days.

@rhubert rhubert force-pushed the rh-bundle branch 3 times, most recently from 9d269b4 to 059f566 Compare January 20, 2025 08:40
'upload' and 'download' did not fit for all archives.
This special archive is used to bundle checkoutWorkspaces into a single
tar-file.
And use the builder functions to enable and finish bundling.
@rhubert rhubert changed the title WIP: bundle bundle Jan 22, 2025
@jkloetzke
Copy link
Member

I thought long about this and I have no final idea yet. So far, just some random thoughts:

  • We should probably use a zip-file. This has at least the benefit that we don't have to extract the whole bundle up-front when reading from it.
  • Your problems with asyncio come from the fact that all up- and downloads are executed in separate processes. They truly run in parallel and are forks of the original Bob pyhton process.
  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

@rhubert
Copy link
Contributor Author

rhubert commented Jan 30, 2025

  • Do we really have to fail on indeterministic checkouts? What if we just ignore them?

🤷 I don't use indeterministic checkouts so I don't have a strong opinion here. I see valid arguments for all 3 options, so maybe we should add a bundle-indeterministic=[fail,ignore,add] option to move this decision to the user?

IMHO, bundling the sources is nothing special compared to up- or downloading dist workspaces. In fact, I could imagine it would make sense to have this feature for all other transports too. Maybe not as default but as option. As such, what about defining new src-upload and src-download archive flags? If you set them explicitly for a backend, the sources are up- and downloaded.

In the end, the bundle option is just another way of defining a local source upload backend. It's just special in the way that it will create a big zip file in the end. That zip file could also hold "normal" binary artifacts. Not sure if this orthogonal design yields something useful but in principle it looks feasible.

Holding it like this we don't even need the --bundle option, right? The user only has to define a archive backend with src-upload and build with --always-checkout to get the backend populated with all the sources. The bundle-indeterminstic option would than become a src-upload-indeterministic flag?

I'm not yet sure what we should do about possible git or svn directories in source workspaces. As they are ignored when hashing the workspace we could skip them when packing the sources too...

Same as for the indeterministic packages. I think there are use cases where it could be useful to have these directories on the air-gaped machine, e.g. when development takes place there. If it's bundled just for review we can skip them, so I'd add another option here to have this configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants