Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kaggle update to release script #3182

Open
3 tasks
bendnorman opened this issue Dec 21, 2023 · 1 comment
Open
3 tasks

Add kaggle update to release script #3182

bendnorman opened this issue Dec 21, 2023 · 1 comment
Labels
kaggle Sharing our data and analysis with the Kaggle community nightly-builds Anything having to do with nightly builds or continuous deployment.

Comments

@bendnorman
Copy link
Member

Right now we are manually updating our Kaggle dataset. Ideally, we would use the kaggle API to automatically update the kaggle version when there is a new version. I took a stab at using the kaggle API but ran into an issue.

Kaggle uses the datapackage schema to track metadata about datasets. I pulled the existing metadata for the PUDL dataset with this command:

kaggle datasets metadata -p . catalystcooperative/pudl-project

where the current directory contained all of the .parquet, .sqlite.gz and .json files of the nightly outputs.

Then I tried to create a new version with this command:

kaggle datasets version -p . -m "Update PUDL dataset to use nightly build outputs from 2023.12.20"

Which uploaded all of the data but then failed with this error: Dataset version creation error: Incompatible Dataset Type

There might be a bug that prevents folks from updating manually created datasets using the Kaggle API. I was able to initialize and update a private Kaggle dataset with the same pudl output files using the CLI.

I propose we point our notebooks at a new kaggle dataset that can be updated using the CLI.

Tasks

@zaneselvans
Copy link
Member

Datapackage annotation

When I first set up the Kaggle dataset, I ran some tests trying to use a datapackage.json to annotate the dataset and found the infrastructure to be non-functional. I posted several messages in their support forums:

Note that only frictionless>=5 can annotate an SQLite DB.

Updating the dataset

Kaggle will automatically create a new version of the dataset on whatever schedule we want (daily, weekly, etc. -- I had it set to weekly udpates) and it will pull new data from the URLs that are specified as the data sources. We only need to intervene when those URLs change, which will hopefully be pretty uncommon. We can decide to point the dataset at /nightly or maybe /stable. Obviously it would be better if we could have it automatically pick up changes in the URLs too! But this is already pretty good.

@zaneselvans zaneselvans added nightly-builds Anything having to do with nightly builds or continuous deployment. kaggle Sharing our data and analysis with the Kaggle community labels Jan 8, 2024
@jdangerx jdangerx moved this from New to Icebox in Catalyst Megaproject Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kaggle Sharing our data and analysis with the Kaggle community nightly-builds Anything having to do with nightly builds or continuous deployment.
Projects
Status: Icebox
Development

No branches or pull requests

2 participants