Public dataset and model distribution based on DVC #5992
Replies: 10 comments 6 replies
-
This sounds like something that is just on the user to determine rather than DVC? Git doesn't prompt you to comply with a license on |
Beta Was this translation helpful? Give feedback.
-
I think about a use case like Kaggle datasets, but for the command line. We can have something like When we What I offer is a simple solution. If there is no |
Beta Was this translation helpful? Give feedback.
-
DVC doesn't host or distribute anything though, it's just tooling. I guess the line is blurred a bit when it comes to Studio, but it still seems to me like anything on the licensing/attribution side of things would be a Studio issue, and not a core DVC issue (similar to the difference between github/gitlab and git). |
Beta Was this translation helpful? Give feedback.
-
Maintaining the legal requirements is certainly not DVC's responsibility, but providing a means for this may increase the adoption IMO. We can have a |
Beta Was this translation helpful? Give feedback.
-
Agree that it could be a need, hypothetically. But I'd wait to see if any users or potential users express this first. |
Beta Was this translation helpful? Give feedback.
-
This still sounds to me like a product that could potentially be built on top of (but separately from) DVC. Core DVC isn't a package manager and I don't think it should try to be one. |
Beta Was this translation helpful? Give feedback.
-
DVC is not a package manager, but a dataset manager. DVC has this functionality as shared remote. The one piece missing is associating metadata like description, license, etc. with
I doubt anybody will ask it first 😄 "We have a datasets collection in If another CLI tool will be used for this, it needs DVC's features. |
Beta Was this translation helpful? Give feedback.
-
@iesahin I think @pmrowla is suggesting that this could be built on top of DVC but be considered a separate product. It seems like this issue is more of a feature request for a public dataset registry, with license confirmation being one of the requirements of that feature request. Would you agree @iesahin? Am I missing anything? |
Beta Was this translation helpful? Give feedback.
-
My 2cs on this. We had a few discussion around this with @dmpetrov and @efiop a few iterations back, and it is one of the topics where it's not that obvious how do draw the line. If generalize it beyond dataset licenses, we can think about dataset descriptions, semantic versions, tags (to do provenance), etc. It everything comes down usually to some attempts to include more meta-information about DVC tracked artifacts into DVC files (+ if needed some modifications to CLI to support it after, but this can be considered optional, since information can be consumed by other clients like Studio).
To summarize. I would split the idea to save some additional info (like we did with descriptions) and idea to act on it (like we do with |
Beta Was this translation helpful? Give feedback.
-
@iesahin it is a great idea but it feels like license information is a next-level abstraction - a packaging system (not Git as @pmrowla mentioned). We had a discussion a long time ago about making a packaging system on top of DVC. This looks like an essential feature for such kind of system, but not for |
Beta Was this translation helpful? Give feedback.
-
There are various licenses for downloadable datasets.
dvc get
anddvc import
can check if there is aLICENSE
file within a tracked directory and print this license, and ask the user for confirmation before download. This allows us to conform with attribution and copyright requirements in licenses like MIT or Apache.For a Git repository directory in the form
we use
dvc get https://github.com/iterative/dataset-registry/fashion-mnist/raw.dvc
to get the dataset.At this point, instead of directly downloading, DVC can check whether there is a
LICENSE
file in the directoryfashion-mnist/
and present it to the user for confirmation. The same is applicable todvc import
.I think this should be the default behavior and an option like
--skip-license-confirmation
is also needed for scripts.This provides a basis to provide all public datasets with different license restrictions in a single dataset registry.
Beta Was this translation helpful? Give feedback.
All reactions