Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data has gone stale #127

Open
sigmondzavion opened this issue Apr 24, 2022 · 10 comments
Open

Data has gone stale #127

sigmondzavion opened this issue Apr 24, 2022 · 10 comments
Labels
help wanted Extra attention is needed

Comments

@sigmondzavion
Copy link

The last update appears to be 4/16.

Q: Do you have an ETA for making the data current?

@aminoplis
Copy link

Hi, any answer to this?

@alan-isaac
Copy link

Still stale. :-(

@anuveyatsu anuveyatsu added the help wanted Extra attention is needed label Aug 9, 2022
@seun-beta
Copy link

seun-beta commented Oct 11, 2022

Hello @anuveyatsu

I hope you are good.

I would love to take up this task to update the data on this repo.

I am currently working on it at the moment.

@seun-beta
Copy link

seun-beta commented Oct 12, 2022

Hello @anuveyatsu

I was able to discover an issue. The GitHub Actions workflow fails because of the large size of the CSV files which is over 100MB (the max file size for GitHub).

I am of the idea that the the result should be written to CSV, compressed and then zipped so as to reduce the size OR the Paraquet should be used as a file format.

Please let me know what you think about it.

@anuveyatsu
Copy link
Member

Thank you @seun-beta for spending time to investigate this issue 👍🏼

I think the best option would be to use git lfs (https://docs.github.com/en/repositories/working-with-files/managing-large-files/configuring-git-large-file-storage) so that we can keep having the data in the consistent format. I'm not sure you'd be able to complete it because I think we need to wire up an external blob storage here (e.g., S3, Google Cloud Storage etc.).

@seun-beta
Copy link

Hello @anuveyatsu

Thank you for your response. I also researched Git LFS initially but the overall setup was a little too much.

An idea about using S3 and Boto3 just popped into my mind. When the workflow run is triggered based on the cron configuration, the code could push results into S3 directly.

What do you think about that?

@mforsetti
Copy link

Hello,

I've tried deploying Git LFS, and getting this error.

> [main eb55196] Auto-update of the data packages
>  8 files changed, 24 insertions(+), 8580484 deletions(-)
> batch response: @github-actions[bot] can not upload new objects to public fork mforsetti/covid-19
> error: failed to push some refs to 'https://github.com/mforsetti/covid-19'
> Error: Process completed with exit code 1.

Apparently Git LFS refuses to push against forks of non-Git LFS parent repo. See git-lfs/git-lfs#1906.

What about gzip-ing the generated CSVs? We can add gunzip-ing code into scripts/update_datapackage.py script.

@gradedSystem
Copy link
Member

@anuveyatsu I think we can go with 2 approach here either using trying git lfs however as @mforsetti mentioned it seems like git-lfs fails to push on public repos, we can use approach with gzip and make it gzip or choose another format

@anuveyatsu
Copy link
Member

@gradedSystem I don't believe it makes sense as upstream repo has been archived: https://github.com/CSSEGISandData/COVID-19

@gradedSystem
Copy link
Member

@anuveyatsu noted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants