Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YouTube-Mix dataset possibly leaks training data to validation and testing #47

Open
markschoene opened this issue Jan 18, 2024 · 0 comments

Comments

@markschoene
Copy link

Hi, I got aware of the youtube mix dataset, which is proposed in this work, via the following papers

In these works, they use 88% of the files as training, and 6% respectively for validation and testing. At least for the current version of the youtube video referenced https://www.youtube.com/watch?v=EhO_MrRfftU, the video is about 45 min of pieces which are repeated 6 times and cut at 4 hours. The 88/6/6% strategy hence yields validation and test sets that are completely contained in the training dataset.

Since this repo is referenced by their works, it might be valuable for future researchers to be made aware of this issue with the YouTube-Mix dataset proposed in this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant