-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functions to export and import packets to/from a zip file #132
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #132 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 40 41 +1
Lines 3671 3750 +79
=========================================
+ Hits 3671 3750 +79 ☔ View full report in Codecov by Sentry. |
4dfdccb
to
df38941
Compare
There's an open question left in the implementation, which I've marked as a TODO: when importing the zip file, we need to hash the metadata files to add the packet to the To solve this we would need to record the hash (or hash algorithm) somewhere in the zip file. I'm not sure what the best place would be. Possible options:
|
I ended up going with the first option, of including an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Paul - this is great, and I think some of the issues that I raised here will sort themselves out once it collides with reality.
The biggest bit to think about is whether we want the whole tree or not, and if we want to think about this now or later. We've actually done more extreme versions of this in the COVID work where we needed to create a new packet that "starts again", cutting itself off from its parents with all artefacts appearing fully formed (probably this will be easier to explain with waving hands).
From the ticket, I do think that the static site version might be worth thinking about - it's not that bad to set up a site, though we may find that users hit issues with what git (and before that github) likes to store. Use of the github api to use releases might help there though, as that's a great place to throw bigger files.
R/export.R
Outdated
call = environment()) | ||
|
||
index <- root$index$data() | ||
packets <- find_all_dependencies(packets, index$metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should offer a non-recursive option here at some (future) point? So that you can dump out a particular packet but not all the bits that lead to it? This would be much smaller often, and more relevant to most users.
The issue here is this requirement:
She has a sequence of Orderly reports. The first report pulls in from a private data source (which cannot be shared), does a bit of analysis and produces an artefact. The subsequent reports build from the first one's output and apply further analyses.
So by not including all dependencies then we can be sure that no private data has been exposed.
At present we need to include all metadata I believe, but importing a packet into a location without its dependent files should be fine so long as the user has require_complete_tree
set to FALSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered the fact that orderly might be storing the inputs to the first task. I think she is using orderly_shared_resource
to ingest the data, which is probably going to leak into the zip file.
On the other hand, not including the first packet means you can't run the reproduce the second one, at least not without some shenanigans to extract the inputs from the second packet.
I think we might need a way of including the first packet, but only its artefacts, not its inputs.
I'll check with Sangeeta exactly how she'd doing it to make sure nothing is leaking accidentally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I was wrong about where the data comes from. It actually comes from an Microsoft Access database, that is being read via orderly.sharedfile. There shouldn't be any particular concern about data leakage, since anything that is extracted from the database can be made public. Unlike orderly_shared_resource
, orderly.sharedfile
doesn't copy the file into the packet (it only records a hash of it).
We might need to do something different in the future, but this seems fine for now.
_pkgdown.yml
Outdated
- title: Exporting packets | ||
contents: | ||
- orderly_export_zip | ||
- orderly_import_zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Torn here on the names; I wonder about orderly_zip_import
/ orderly_zip_export
as most of the names work like this (to help with autocomplete). OTOH, if we had different import/export options in future, your way is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah now that you mention it I think I like those names better. I renamed them, and also renamed the file to zip.R
.
I think I would have preferred a more generic name instead of zip
, and then in the future accept a format = "zip"
argument that could take other values. Sadly the most obvious candidate, archive
, is already used to mean something completely different. tarball
kind of works, but it implies the tar
format which is not much better than zip
is.
While the ability to push and pull from remote locations that orderly already supports works well for back-and-forth collaboration, it is not as well suited to produce a one-time artefact that may be released and shared publicly, without depending on a server or shared location. One use case for this is for publishing a reproducible set of analyses to accompany a paper. This commit adds a pair of functions, orderly_export_zip and orderly_import_zip. These allow a set of packets (and their transitive dependencies) to be exported as a standalone zip file, containing both the metadata and files. The zip file can then be imported into a different repository. The zip file is formatted as a metadata directory, with a file per packet, and a content-addressed file store.
Co-authored-by: Rich FitzJohn <[email protected]>
Co-authored-by: Rich FitzJohn <[email protected]>
While the ability to push and pull from remote locations that orderly
already supports works well for back-and-forth collaboration, it is not
as well suited to produce a one-time artefact that may be released
and shared publicly, without depending on a server or shared location.
One use case for this is for publishing a reproducible set of analyses
to accompany a paper.
This commit adds a pair of functions, orderly_export_zip and
orderly_import_zip. These allow a set of packets (and their transitive
dependencies) to be exported as a standalone zip file, containing both
the metadata and files. The zip file can then be imported into a
different repository.
The zip file is formatted as a metadata directory, with a file per
packet, and a content-addressed file store.