Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functions to export and import packets to/from a zip file #132

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

plietar
Copy link
Member

@plietar plietar commented Mar 18, 2024

While the ability to push and pull from remote locations that orderly
already supports works well for back-and-forth collaboration, it is not
as well suited to produce a one-time artefact that may be released
and shared publicly, without depending on a server or shared location.
One use case for this is for publishing a reproducible set of analyses
to accompany a paper.

This commit adds a pair of functions, orderly_export_zip and
orderly_import_zip. These allow a set of packets (and their transitive
dependencies) to be exported as a standalone zip file, containing both
the metadata and files. The zip file can then be imported into a
different repository.

The zip file is formatted as a metadata directory, with a file per
packet, and a content-addressed file store.

Copy link

codecov bot commented Mar 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (5e11123) to head (751c93c).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #132   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           40        41    +1     
  Lines         3671      3750   +79     
=========================================
+ Hits          3671      3750   +79     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@plietar plietar force-pushed the mrc-5164 branch 2 times, most recently from 4dfdccb to df38941 Compare March 19, 2024 17:30
@plietar plietar changed the title Add a way to export packets in a redistributable way Add functions to export and import packets to/from a zip file Mar 19, 2024
@plietar plietar marked this pull request as ready for review March 19, 2024 17:35
@plietar plietar requested a review from richfitz March 19, 2024 17:35
@plietar
Copy link
Member Author

plietar commented Mar 19, 2024

There's an open question left in the implementation, which I've marked as a TODO: when importing the zip file, we need to hash the metadata files to add the packet to the local location. At present, I've hashed the metadata file using the algorithm configured on the root we are importing into. However this is wrong if the root that originally created the packet used a different algorithm. We might end up with two different hashes, using two different algorithms, for the same packet id. This will cause problems later if the two roots end up sharing a location.

To solve this we would need to record the hash (or hash algorithm) somewhere in the zip file. I'm not sure what the best place would be. Possible options:

  • Include an index file at the root of the zip file, with the list of packets and their hash. This is more or less the equivalent of the /metadata/list endpoint of the http API.
  • Wrap each metadata files in a JSON object, along the lines of { "hash": "sha256:xxx", "contents": "{ ... }" }, where contents is a string containing the packet metadata. We need it to be a string to preserve its encoding and hash.
  • Name the metadata files based on their hash. Basically this would make the metadata repository content-addressed as well. We wouldn't loose track of the packet ID since it is contained in the metadata.

@plietar
Copy link
Member Author

plietar commented Mar 19, 2024

I ended up going with the first option, of including an outpack.json file at the root of the zip file which contains the list of packets and their hash. This also lets us detect zip files that are obviously not supposed to be imported.

Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Paul - this is great, and I think some of the issues that I raised here will sort themselves out once it collides with reality.

The biggest bit to think about is whether we want the whole tree or not, and if we want to think about this now or later. We've actually done more extreme versions of this in the COVID work where we needed to create a new packet that "starts again", cutting itself off from its parents with all artefacts appearing fully formed (probably this will be easier to explain with waving hands).

From the ticket, I do think that the static site version might be worth thinking about - it's not that bad to set up a site, though we may find that users hit issues with what git (and before that github) likes to store. Use of the github api to use releases might help there though, as that's a great place to throw bigger files.

R/export.R Outdated Show resolved Hide resolved
R/export.R Outdated Show resolved Hide resolved
R/export.R Outdated Show resolved Hide resolved
R/export.R Outdated
call = environment())

index <- root$index$data()
packets <- find_all_dependencies(packets, index$metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should offer a non-recursive option here at some (future) point? So that you can dump out a particular packet but not all the bits that lead to it? This would be much smaller often, and more relevant to most users.

The issue here is this requirement:

She has a sequence of Orderly reports. The first report pulls in from a private data source (which cannot be shared), does a bit of analysis and produces an artefact. The subsequent reports build from the first one's output and apply further analyses.

So by not including all dependencies then we can be sure that no private data has been exposed.

At present we need to include all metadata I believe, but importing a packet into a location without its dependent files should be fine so long as the user has require_complete_tree set to FALSE

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered the fact that orderly might be storing the inputs to the first task. I think she is using orderly_shared_resource to ingest the data, which is probably going to leak into the zip file.

On the other hand, not including the first packet means you can't run the reproduce the second one, at least not without some shenanigans to extract the inputs from the second packet.

I think we might need a way of including the first packet, but only its artefacts, not its inputs.

I'll check with Sangeeta exactly how she'd doing it to make sure nothing is leaking accidentally.

Copy link
Member Author

@plietar plietar Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I was wrong about where the data comes from. It actually comes from an Microsoft Access database, that is being read via orderly.sharedfile. There shouldn't be any particular concern about data leakage, since anything that is extracted from the database can be made public. Unlike orderly_shared_resource, orderly.sharedfile doesn't copy the file into the packet (it only records a hash of it).

We might need to do something different in the future, but this seems fine for now.

R/export.R Outdated Show resolved Hide resolved
_pkgdown.yml Outdated
- title: Exporting packets
contents:
- orderly_export_zip
- orderly_import_zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Torn here on the names; I wonder about orderly_zip_import / orderly_zip_export as most of the names work like this (to help with autocomplete). OTOH, if we had different import/export options in future, your way is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah now that you mention it I think I like those names better. I renamed them, and also renamed the file to zip.R.

I think I would have preferred a more generic name instead of zip, and then in the future accept a format = "zip" argument that could take other values. Sadly the most obvious candidate, archive, is already used to mean something completely different. tarball kind of works, but it implies the tar format which is not much better than zip is.

plietar and others added 12 commits April 3, 2024 15:46
While the ability to push and pull from remote locations that orderly
already supports works well for back-and-forth collaboration, it is not
as well suited to produce a one-time artefact that may be released
and shared publicly, without depending on a server or shared location.
One use case for this is for publishing a reproducible set of analyses
to accompany a paper.

This commit adds a pair of functions, orderly_export_zip and
orderly_import_zip. These allow a set of packets (and their transitive
dependencies) to be exported as a standalone zip file, containing both
the metadata and files. The zip file can then be imported into a
different repository.

The zip file is formatted as a metadata directory, with a file per
packet, and a content-addressed file store.
Co-authored-by: Rich FitzJohn <[email protected]>
Co-authored-by: Rich FitzJohn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants