-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Archive vcerare
data
#438
Archive vcerare
data
#438
Conversation
vceregen
data and actually pass contributor type to Zenodo metadatavceregen
data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I had one note for a potential future improvement that I don't think we really need to worry about at this point, and one minor non-blocking style suggestion.
bucket = storage.Client().get_bucket(self.bucket_name) | ||
blobs = bucket.list_blobs(prefix=f"{self.name}/") # Get all blobs in folder | ||
|
||
for blob in blobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking: At some point it might be cool to use universal-pathlib which allows you to work with GCS objects like normal python Path
's. I definitely don't think it's worth the effort right now though.
# If data source was manually archived by us, specify that the | ||
# data_source.path is a documentation link, rather than where we archived | ||
# the data from. | ||
if data_source_id in ["gridpathratoolkit", "vceregen"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be slightly cleaner to just set the title/description in the if statement and have a single return at the end since the rest of the fields are all the same.
Overview
Closes #434.
What problem does this address?
Adds
vcerare
data to Zenodo. Because this data was provided to us and isn't available online, we upload our copy to a GCS bucket so we can archive from a stable source, as well as capture any file changes over time as new data is added.What did you change in this PR?
vcerare files from
sources.catalyst.coop` GCS bucket.pdf
and.md
filetypes to frictionlessvcerare
torun-archiver.yml
and add GCS configurationNote: Zenodo does not allow you to set which files are previewed from the API, and it automatically previews the "first file", which it determines by alphabetical sorting - this will always be
datapackage.json
for this repository. We'll need to manually check "preview" next to the README file before publishing when we update this dataset.Note 2: This PR contemplated adding contributor types to the Zenodo payload. We were setting them in
zenodo_role
but not passing them to the Zenodo API. I configured this to specify type forcreator
, as is possible in the GUI, but this is only possible for collaborators in the API. So, unless we reconfigure the way we handle creators this isn't possible - for now, we'll just made any adjustments manually in the GUI since we rarely refresh metadata.Testing
How did you make sure this worked? How can a reviewer verify this?
Run
pudl_archiver --datasets vcerare --sandbox --initialize
. See also: https://zenodo.org/records/13919960To-do list
Tasks