GitHub

This repository aims to expand novel solutions of music genre classification with deep learning. The end goal is to use this deep learning method to organize a users saved music library into discrete clusters. From here, it will be possible to use this method as a content based music recomendation system as compared to collaborative filtering methods.

Network Architecture The neural network architecture developed here was influenced strongly by the work done by Liu et al. In this work a convolution network architecture combining dense connectivity and inception blocks was presented as a bottom up broadcast neural network. This network architecture takes multi-scale time-frequency information into consideration which creates semantic features for the decision layer to descriminate genre of an unknown music clip. The architecture network is shown below (Liu et al.)

Training Dataset The GTZAN dataset was used for model training and evaluation. This dataset consists of 1000 clips of 30 seconds of audio evenly distributed across 10 genres (classic, jazz, blues, metal, pop, rock, country, disco, hiphop and reggae). The audio clips were transformed into mel-spectrogram (generate_spectrograms.py) by applying a logarithmic scale to the frequency axis of the short time fourier transform. Librosa was used to ectract mel-spectrotgrams with 128 mel filters (bands) covering the frequency range 0-22050Hz, with a frame length of 2048 and hop size of 1024. The result is a 647 x 128 image.

With a considerably small dataset, data augmentation was used (generate_data_augmentations.py)as described in by Le et al. which directly modify the mel-spectrogram rather than audio modification approaches such as pitch shifting. The figure below shows an example of these augmentation results. The notebook AudioAugmentation in this repository hosts code for these augmentations and contains audio playback widgets to hear the effects of mel-spectrogram augmentation.

Performance Evaluation The model was trained to minimize categorical cross-entropy between the predictions and truthful genre labels using ADAM optimizer. A batch size of 8 was used for 100 epochs in the training process. The initial learning rate was set to 0.01 and automatically decrease by a factor of 0.5 when the loss has stopped improving after 3 epochs. Training, testing and validation sets were randomly partitioned following 8-1-1 proportions. Data was hosted on AWS and training was carried out on an ec2 cluster (using train_model.py). The figures below show the training/testing curves and validation confusion matrix.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
CloudTraining		CloudTraining
Data/gtzan_raw		Data/gtzan_raw
Data_Augmentation		Data_Augmentation
Figures		Figures
Notebooks		Notebooks
Papers		Papers
spotify_kaggle_dataset		spotify_kaggle_dataset
Genre_Track_Id_Dict.json		Genre_Track_Id_Dict.json
README.md		README.md
generate_data_augmentations.py		generate_data_augmentations.py
generate_spectrograms.py		generate_spectrograms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

MatthewMallory/auditory_deep_learning

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages