Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow entire dataset to be downloaded en-masse #669

Closed
1 task
AetherUnbound opened this issue Sep 20, 2022 · 4 comments
Closed
1 task

Allow entire dataset to be downloaded en-masse #669

AetherUnbound opened this issue Sep 20, 2022 · 4 comments
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API 💬 talk: discussion Open for discussions and feedback

Comments

@AetherUnbound
Copy link
Collaborator

Description

Presently if users want our entire dataset, they must crawl through all possible searches in hopes of pulling up the results we have. We've discussed this in the past, but it would be ideal to have a bulk download option available for those who would like to use the entire dataset (e.g. iNaturalist's dataset: https://github.com/inaturalist/inaturalist-open-data)

This could be parquet or TSV files on S3 which have public accessibility, or some other means of pulling the entire dataset.

Implementation

  • 🙋 I would be interested in implementing this feature.
@AetherUnbound AetherUnbound added 🌟 goal: addition Addition of new feature 🕹 aspect: interface Concerns end-users' experience with the software 🟩 priority: low Low priority and doesn't need to be rushed labels Sep 20, 2022
@MallikharjunaTeja
Copy link

MallikharjunaTeja commented Oct 11, 2022

i want to work on this feature @AetherUnbound @dhruvkb could you assign me this

@krysal krysal added the 💬 talk: discussion Open for discussions and feedback label Oct 12, 2022
@AetherUnbound
Copy link
Collaborator Author

AetherUnbound commented Oct 25, 2022

Hi @MallikharjunaTeja! Thanks for offering your assistance 🙂 Before work proceeds on this, we need a plan fleshed out for what these bulk downloads would look like. How will the files be generated from our system? Would there need to be coordination with the Openverse Catalog, since we would likely need a scheduled DAG in order to run this? What fields and/or models would we include and exclude? I think this project will ultimately need an RFC written for it, you can find instructions and examples here: https://github.com/WordPress/openverse/tree/main/rfcs. The maintainers group currently doesn't have this slated for our near-term priorities, but if you would like to go ahead and give this a shot please feel free! We're happy to assist you and answer any questions you might have, particularly over in the Make WP Slack #openverse channel. Please let me know if you'd like to take on this work and I'll assign the issue to you.

Alternatively, we have a large number of issues across our repos which are marked as "good first issues". These issues were ones we felt it might be easy to jump into as a contributor. If you're looking to contribute to the project in general, I encourage you to take a look at the list here. We'd be happy to assign any one of those issues to you as well 😄

@obulat obulat transferred this issue from WordPress/openverse-api Feb 22, 2023
@obulat obulat added 🧱 stack: api Related to the Django API and removed 🧱 stack: backend labels Mar 20, 2023
@Skylion007
Copy link

Interested in discussing this, even for a one time export.

@zackkrida
Copy link
Member

Closing this in favor of tracking via this project: #2545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🌟 goal: addition Addition of new feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: api Related to the Django API 💬 talk: discussion Open for discussions and feedback
Projects
Archived in project
Development

No branches or pull requests

6 participants