Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Proposal: Openverse Datasets #2637

Merged
merged 12 commits into from
Aug 17, 2023
Merged

Conversation

zackkrida
Copy link
Member

@zackkrida zackkrida commented Jul 12, 2023

Fixes

Related to #2545

Description

This PR adds a project proposal for the Dataset project. I've tried to get this out quickly so it might be a bit rough. Suggestions are very welcome. I've asked @sarayourfriend (for providing past feedback on this initiative) and @AetherUnbound (for general data expertise) to review from the Openverse side.

I would also appreciate insights from @apolinario on the HuggingFace platform: how it relates to this project but also some of its general functionality which I touch on in the proposal.

Descisionmaking

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.

Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

Will be resolved by 2023-07-20.

Testing Instructions

Read the document in GitHub's code view or the generated docs preview.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@zackkrida zackkrida requested a review from a team as a code owner July 12, 2023 11:40
@zackkrida zackkrida requested review from AetherUnbound, stacimc and sarayourfriend and removed request for stacimc July 12, 2023 11:40
@github-actions github-actions bot added the 🧱 stack: documentation Related to Sphinx documentation label Jul 12, 2023
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Jul 12, 2023
@zackkrida zackkrida added 🟧 priority: high Stalls work on the project or its dependents 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧭 project: proposal A proposal for a project and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Jul 12, 2023
@github-actions
Copy link

github-actions bot commented Jul 12, 2023

Full-stack documentation: https://docs.openverse.org/_preview/2637

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@apolinario
Copy link

apolinario commented Jul 12, 2023

Heya, I think the project proposal looks great @zackkrida! Thanks for putting it together! I made very minor comments across the doc

how it relates to this project but also some of its general functionality which I touch on in the proposal.

I think the document has a fair summary of the capabilities of the platform relative to this project. I would just add that the streaming feature of datasets allow for making it accessible to people that may not have access to storage that allows storing the entire data dump. May be helpful as part of the democratisation of this data

@Skylion007
Copy link

Heya, I think the project proposal looks great @zackkrida! Thanks for putting it together! I made very minor comments across the doc

how it relates to this project but also some of its general functionality which I touch on in the proposal.

I think the document has a fair summary of the capabilities of the platform relative to this project. I would just add that the streaming feature of datasets allow for making it accessible to people that may not have access to storage that allows storing the entire dataset. May be helpful as part of the democratisation of this data

+1 on streaming the dataset. It also can allow people to easily and quickly generate various subsets of the data.

Copy link
Collaborator

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This project plan looks excellent, thank you for drafting it Zack! My notes are merely surface level/wording, I'm aligned with everything else here 🙂

sarayourfriend
sarayourfriend previously approved these changes Jul 13, 2023
Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I'd like clarification and/or additional information about the following points:

  • Licensors, from my perspective, are also stakeholders. Respecting their intentions and properly communicating the usage conditions is especially important for a project where every single work has an explicit license. Noting specific license elements like NC, ND, and SA that have known nuances in the distribution of the dataset feels important enough to list as a requirement.
  • On the other hand, "licensors" are not the only stakeholders from that perspective and PDM works present a significant complication in this regard, both from a regional legal perspective and from the perspective of the communication of cultural artefacts that the institutions distributing PDM marked works based on those artefacts have obtained and distributed without consultation or otherwise involvement of the culture the artefacts were taken from. Openverse can't do much to fix the underlying problems, but we do need to take care to protect our liability in this regard the same way we do in our general terms of service. Should the dataset have the same terms of service applied? A specific terms of service/disclaimer worded directly for the dataset would better protect the project, I imagine.
  • Does the first implementation plan also include the documentation updates you mentioned? Can that be listed as an explicit requirement? It is easy to miss something like that when writing an implementation plan that is sure to already be significant in other respects.

Anyway, everything sounds good to me. The rationale to use HuggingFace makes sense. My only concern moving forward is to ensure that we've covered our bases as far as liability to ourselves and have communicated as effectively as possible to users of the dataset their responsibility in using the dataset.

Nothing else to add on top of what others have shared.

@sarayourfriend sarayourfriend dismissed their stale review July 13, 2023 02:27

I didn't mean to approve. I think we can expedite this project proposal fairly easily but I do want clarification on my three points before approving.

@zackkrida
Copy link
Member Author

zackkrida commented Jul 17, 2023

Drafting this proposal while I move it into the Revision round. Feedback is still welcomed!

@zackkrida zackkrida marked this pull request as draft July 17, 2023 14:47
@zackkrida
Copy link
Member Author

I've addressed reviewer comments and this proposal is now ready for a decision.

Copy link
Collaborator

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Shall we add a line item to the priorities meeting to discuss the implementation plans? How does this fall in line with the rest of our work? (we can discuss this on the project thread, if it's more involved)

@zackkrida
Copy link
Member Author

I'm going to leave this open for a few more days with the goal of soliciting further community feedback.

@zackkrida
Copy link
Member Author

I'll leave some scheduling thoughts on the project thread, @sarayourfriend

@zackkrida
Copy link
Member Author

@apolinario yesterday I saw that the https://huggingface.co/meta-llama/Llama-2-7b model has an access flow which requires accepting terms and signing up through a meta controlled page: https://ai.meta.com/resources/models-and-libraries/llama-downloads/

I am curious: is this type of flow available to datasets, as well? We wouldn't necessarily want folks to wait 1-2 days for access, but the idea of more-explicitly enforcing our terms for dataset usage (respecting proper usage and whatnot) is appealing.

@sarayourfriend
Copy link
Collaborator

sarayourfriend commented Jul 21, 2023

@zackkrida it is possible for datasets: https://huggingface.co/docs/hub/datasets-gated

We could use a small interaction with our Django service (Django rendered page with the form and more flexibility in presentation etc) by using the manual approvals.

@apolinario
Copy link

apolinario commented Jul 24, 2023

As @sarayourfriend noted, gating repos is totally available for datasets!

And it comes in two flavours: manual approval which basically restricts who filled in the form and has the approval done manually (or automated that via a Django service).

BUT if you are going to automate the approvals, the other mode of gated access (automatic approval) would make more sense imo. Basically this requires people to read/accept information they can use it, but everyone that does is accepted automatically and can use it

It is what Mozilla uses for common voice (example here) - I think that could make sense for this project as well

image

@zackkrida
Copy link
Member Author

Thanks, @apolinario! I am going to draft this proposal while we make more efforts to consider our specific use conditions and ethical standards for the dataset, along with how they would relate to the "Gating" functionality.

@zackkrida zackkrida marked this pull request as draft July 24, 2023 17:53
@apolinario
Copy link

apolinario commented Jul 24, 2023

Sounds good! Here's an idea that I had that I hope could help tackling a few challenges that may arise for use conditions & ethical standards:

Multiple-subsets

Instead of a single Openverse dataset (or two, one for visual media and one for audio), we could create multiple subsets based on license or license-grouping, e.g. (not really name suggestion here, just examples):

  • openverse/images-cc-by
  • openverse/images-cc-by-nc
  • openverse/images-cc0
    etc.

All in the same data format, but each could have its own dataset card and its own set of disclaimers and descriptions (and all under your terms of service disclaimers ofc). In one hand, this could make using this datasets for downstream tasks a bit more convoluted/complex (as now one has to engage with/accept terms/process multiple datasets), on the other hand, it would make it very obvious what each dataset could be for, and it could inform downstream users very specifically what they are doing, as they would need to write code that looks like:

from datasets import load_dataset
dataset_cc_by = load_dataset("openverse/images-cc-by", token=True)
dataset_cc_by_nc = load_dataset("openverse/images-cc-by-nc", token=True)
#do downstream tasks processing both

That could make it pretty clear that they should not do that if they are looking into doing smth commercial. Although filtering a column by value (e.g.: the license column) enough with the HF datasets library, but maybe this is a way to make it even more explicit and understood even from reading the code - and not only by reading the model card.

This could also co-exist with gating, so each dataset repo could be gated. Btw the load_dataset function gives a 403 error if the dataset is gated and your HF user isn't in.

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the several good points captured here. Just restating what has been said, but it's an excellent proposal!

or made easier by the publication of the datasets. This could work in a few
ways. A community member, training a model using the Openverse dataset,
generates metadata that we want and planned to generate ourselves. Then, the
HuggingFace platform presents an alternative to other SaSS products we intended
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "SaSS" the short version of something? I can't find a different meaning other than the CSS extension language, SASS. Can we add the full form or the meaning in a note/footnote, maybe?

Copy link
Member Author

@zackkrida zackkrida Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo for SaaS (software as a service), I'll update to explain.

@AetherUnbound AetherUnbound marked this pull request as ready for review August 17, 2023 17:16
@AetherUnbound
Copy link
Collaborator

Merging this for now so the document is in our documentation site, even though we are not planning on pursuing it at this time.

@AetherUnbound AetherUnbound merged commit d8574af into main Aug 17, 2023
48 checks passed
@AetherUnbound AetherUnbound deleted the dataset-project-proposal branch August 17, 2023 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟧 priority: high Stalls work on the project or its dependents 🧭 project: proposal A proposal for a project 🧱 stack: documentation Related to Sphinx documentation
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants