-
Notifications
You must be signed in to change notification settings - Fork 3
Home
A machine learning approach to the million song dataset by Aumit Leon and Mariana Echeverria
The data directory has subdirectories that act like volumes-- if you go deep enough you'll find the H5 files that correspond to each song.
The Million Songs Dataset has data on 1,000,000 songs for 44,745 unique artists, along with user supplied tags from the MusicBrainz website.
The data is given to us in HD5 format (https://support.hdfgroup.org/HDF5/whatishdf5.html).
HD5 files are binary files, so they are not very useful to us as they are given. In order to extract data from the h5 files, use get_data.py.
The million song dataset provides python wrappers within hd5_getters.py that can be used to recursively loop through each subdirectory and h5 file to extract certain features of the data.
get_data.py will visit every subdirectory (starting from the path you give indir), and will create a CSV of the data extracted from each h5 file. You don't need to put this script any place special, just be sure to provide it a proper path for indir. The output.csv file will be created in the same directory as this python script, so be sure not to commit that CSV file to Git :)
The year prediction dataset is a simplified subset of the Million Song Dataset. This dataset has 90 attributes (features): 12 = timbre average, 78 = timbre covariance. The dataset is available at: http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
When you download the dataset, you get a large, comma separated text file-- because the data is already comma separated, to get this data into a CSV you can open with excel, cd
into the diectory where you have the dataset downloaded, and run the following: cat YearPredictionsMSD.txt > yp.csv
- Finish extracting the data, pick out what features we want to use
- Pick what aspect of the data we want to run experiment on
- Prototype some crude ML models!
The dataset uses the Echo Nest API to collect quantitative information about songs (danceability, tempo, loudness, segment analysis, etc). The Echo Nest was acquired by Spotify and integreated into Spotify API's, and is still available for use by developers. To learn more, visit: https://developer.spotify.com/spotify-echo-nest-api/