Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect Junk Datasets #29

Open
PGijsbers opened this issue Jul 25, 2024 · 0 comments
Open

Detect Junk Datasets #29

PGijsbers opened this issue Jul 25, 2024 · 0 comments
Assignees

Comments

@PGijsbers
Copy link
Member

PGijsbers commented Jul 25, 2024

Heavily related to #16. For this item, we want a bot that can detect which datasets should be inspected for removal. Datasets should be slated for removal if it is clear that the dataset was never intended to be shared to be public for use in ML experiments.

There are quite a few datasets on OpenML which are uploaded by users that should not be on the production server. This includes users uploading datasets to test upload functionality, users that made mistakes on initial uploads so uploaded newer versions, and so on.

Image

Besides a bad title and description, other indications may also be: having no tasks, or only tasks without runs. A good title and description that is duplicate from existing datasets. It may not always be obvious, and it's ok if the bot misses some of the poor quality data. It is important that the bot has a relatively high precision, as each flagged dataset will require a human to asses if deactivation/deletion is warranted.

This is also true for studies.

Besides flagging the dataset, the bot should be able to generate a small report explaining why the dataset may be considered for removal.

@PGijsbers PGijsbers added this to the Metadata Quality milestone Jul 25, 2024
@LiinXemmon LiinXemmon self-assigned this Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants