GitHub - AumitLeon/million-songs: Machine Learning approach to the Million Songs dataset

Million Songs Data Set Analysis!

This project depends on the Million Songs Dataset available at: https://labrosa.ee.columbia.edu/millionsong/

Initial experiments are done with the smaller, experimental subset provided at: https://labrosa.ee.columbia.edu/millionsong/pages/getting-dataset

Contributers: Aumit Leon, Mariana Echeverria

Experiments and Results

To view our experiments and results, check out our wiki: https://github.com/AumitLeon/million-songs/wiki

Directory Overview

Once you download the dataset, you'll notice that the file structure is as follows:

    MillionSongSubset/
        AdditionalFiles/
            ...
        data/
            A/
                A/
                ...
                Z/
            B/
                A/
                ...
                I/

The data directory has subdirectories that act like volumes-- if you go deep enough you'll find the H5 files that correspond to each song.

Converting the data to a usable format

The data is given to us in HD5 format (https://support.hdfgroup.org/HDF5/whatishdf5.html).

HD5 files are binary files, so they are not very useful to us as they are given. In order to extract data from the h5 files, use get_data.py.

The million song dataset provides python wrappers within hd5_getters.py that can be used to recursively loop through each subdirectory and h5 file to extract certain features of the data.

get_data.py will visit every subdirectory (starting from the path you give indir), and will create a CSV of the data extracted from each h5 file. You don't need to put this script any place special, just be sure to provide it a proper path for indir. The output.csv file will be created in the same directory as this python script, so be sure not to commit that CSV file to Git :)

As far as I can tell, each h5 file corresponds to one song. THat might not be true of every h5 file, maybe there is a way we can verify this?

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
experiment-notebooks		experiment-notebooks
plots		plots
MBkmeans.py		MBkmeans.py
README.md		README.md
get_data.py		get_data.py
hdf5_getters.py		hdf5_getters.py
kmeans.py		kmeans.py
lyrics-experiment.py		lyrics-experiment.py
minibatchKM.py		minibatchKM.py
mxm_dataset_test.txt		mxm_dataset_test.txt
mxm_dataset_train.txt		mxm_dataset_train.txt
nnet-cluster-prediction-scikit.py		nnet-cluster-prediction-scikit.py
nnet-year-prediction-keras.py		nnet-year-prediction-keras.py
nnet-year-prediction.py		nnet-year-prediction.py
svm-year-prediction.py		svm-year-prediction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Million Songs Data Set Analysis!

Experiments and Results

Directory Overview

Converting the data to a usable format

About

Releases

Packages

Languages

AumitLeon/million-songs

Folders and files

Latest commit

History

Repository files navigation

Million Songs Data Set Analysis!

Experiments and Results

Directory Overview

Converting the data to a usable format

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages