Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'small' subset #13

Open
andimarafioti opened this issue Jun 27, 2019 · 4 comments
Open

Add 'small' subset #13

andimarafioti opened this issue Jun 27, 2019 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@andimarafioti
Copy link

Hi! thanks for the dataset. It would be useful for me if you provided a 'small' subset like FMA (they do 8,000 tracks of 30s, 8 balanced genres (GTZAN-like) (7.2 GiB)). I know I could make a subset myself with the script cited on the readme, but I would need to download 100x the amount of data I want and then process it. If you think it's worth it, and are willing to host it, I can also make the subset myself and upload it somewhere. Thanks!

@dbogdanov
Copy link
Member

Hi @andimarafioti, yes we are working on that ;-) Will update soon.

@philtgun philtgun added the enhancement New feature or request label Jul 29, 2019
@abugler
Copy link

abugler commented Feb 22, 2021

Hi! Is there an update on the small subset? Thank you so much.

@dbogdanov
Copy link
Member

Note that we have included lower-bitrate mono audio downloads that significantly reduce the download size (full dataset: 508 GB to 156 GB). I assume this is not small enough for a "small" dataset...

We lack a specific proposal for what the small subset should include. Should it cover all tags in MTG-Jamendo or a subset of tags?

Another alternative is to create a version of the full dataset with audio fragments instead of full tracks. Using 2 min or 30 second fragments for each track reduces the total dataset size from ~3778 hours to 1856.7 or 464 hours, respectively. The low-bitrate mono audio 30-second fragment version would take ~19 GB which is very reasonable.

@dbogdanov
Copy link
Member

Related to this, @philtgun has previously done a subset of MTG-Jamendo with one random track per artist (5 random trials) and one random track per album to see the statistics (autotagging_toy_0..4 and autotagging_toy_album_0). Leaving this here for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants