Archive `vcerare` data #438

e-belfer · 2024-10-04T16:23:00Z

Overview

Closes #434.

What problem does this address?
Adds vcerare data to Zenodo. Because this data was provided to us and isn't available online, we upload our copy to a GCS bucket so we can archive from a stable source, as well as capture any file changes over time as new data is added.

What did you change in this PR?

Archive vcerare files from sources.catalyst.coop` GCS bucket
Add .pdf and .md filetypes to frictionless
Add vcerare to run-archiver.yml and add GCS configuration

Note: Zenodo does not allow you to set which files are previewed from the API, and it automatically previews the "first file", which it determines by alphabetical sorting - this will always be datapackage.json for this repository. We'll need to manually check "preview" next to the README file before publishing when we update this dataset.

Note 2: This PR contemplated adding contributor types to the Zenodo payload. We were setting them in zenodo_role but not passing them to the Zenodo API. I configured this to specify type for creator, as is possible in the GUI, but this is only possible for collaborators in the API. So, unless we reconfigure the way we handle creators this isn't possible - for now, we'll just made any adjustments manually in the GUI since we rarely refresh metadata.

Testing

How did you make sure this worked? How can a reviewer verify this?
Run pudl_archiver --datasets vcerare --sandbox --initialize. See also: https://zenodo.org/records/13919960

To-do list

Tasks

Give feedback

Add GCS credentials to Github, mirroring pudl-usage-metrics
Review the PR yourself and call out any questions or issues you have
Debug gdal problem - see gdal v3.9.3 conda-forge/gdal-feedstock#991 or pin in PUDL?
Options

zschira

Looks good! I had one note for a potential future improvement that I don't think we really need to worry about at this point, and one minor non-blocking style suggestion.

zschira · 2024-10-15T21:47:00Z

src/pudl_archiver/archivers/vcerare.py

+        bucket = storage.Client().get_bucket(self.bucket_name)
+        blobs = bucket.list_blobs(prefix=f"{self.name}/")  # Get all blobs in folder
+
+        for blob in blobs:


Not blocking: At some point it might be cool to use universal-pathlib which allows you to work with GCS objects like normal python Path's. I definitely don't think it's worth the effort right now though.

zschira · 2024-10-15T21:50:16Z

src/pudl_archiver/depositors/zenodo/entities.py

+        # If data source was manually archived by us, specify that the
+        # data_source.path is a documentation link, rather than where we archived
+        # the data from.
+        if data_source_id in ["gridpathratoolkit", "vceregen"]:


Might be slightly cleaner to just set the title/description in the if statement and have a single return at the end since the rest of the fields are all the same.

e-belfer added 2 commits October 4, 2024 08:24

Add VCE rengen archiver

7e14840

Add contributor type and add README.md file and md file type

fda8146

e-belfer added zenodo vcerare VCE Resource Adequacy Renewable Energy (RARE) data labels Oct 4, 2024

e-belfer self-assigned this Oct 4, 2024

e-belfer mentioned this pull request Oct 4, 2024

Archive gridpathratoolkit data from GCS #439

Merged

e-belfer and others added 7 commits October 4, 2024 14:46

Add vceregen to GHA and authenticate GCS

bf65d1f

Add sandbox and initialize flags to test

e57f920

Update permissions

cef3fe4

Merge branch 'main' into vceregen

c823fa6

Fix indentation on workflow file

fdf986c

Remove zenodo user role, remove sandbox

29054bb

Restore archiver command

9a8f58c

e-belfer requested a review from aesharpe October 11, 2024 13:50

Add prod and sandbox dois to yaml

47e5dbf

e-belfer changed the title ~~Archive vceregen data and actually pass contributor type to Zenodo metadata~~ Archive vceregen data Oct 11, 2024

update vceregen to vcerare

71c5b1f

e-belfer changed the title ~~Archive vceregen data~~ Archive vcerare data Oct 15, 2024

e-belfer and others added 5 commits October 15, 2024 17:04

Merge branch 'main' into vceregen

cfb3e56

Fix docstring

6b68bb1

Update gdal version

17cc181

update gdal version

7e735a7

Update vcerare in entities.py

09da494

zschira approved these changes Oct 15, 2024

View reviewed changes

e-belfer and others added 5 commits October 16, 2024 08:28

Update doi and clean up return function

262a3ff

Update vceregen to vcerare

8d5b21f

Merge branch 'main' into vceregen

cfc5dee

Merge branch 'main' into vceregen

076ef3d

Merge branch 'main' into vceregen

e30bd6b

e-belfer merged commit 809c37c into main Oct 22, 2024
3 checks passed

e-belfer deleted the vceregen branch October 22, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Archive `vcerare` data #438

Archive `vcerare` data #438

e-belfer commented Oct 4, 2024 •

edited by zaneselvans

Loading

Tasks

zschira left a comment

zschira Oct 15, 2024

zschira Oct 15, 2024

Archive vcerare data #438

Archive vcerare data #438

Conversation

e-belfer commented Oct 4, 2024 • edited by zaneselvans Loading

Overview

Testing

To-do list

Tasks

zschira left a comment

Choose a reason for hiding this comment

zschira Oct 15, 2024

Choose a reason for hiding this comment

zschira Oct 15, 2024

Choose a reason for hiding this comment

Archive `vcerare` data #438

Archive `vcerare` data #438

e-belfer commented Oct 4, 2024 •

edited by zaneselvans

Loading