Skip to content

Year Prediction Experiments

Aumit Leon edited this page Dec 7, 2017 · 11 revisions

In order to develop a machine learning model that could accurately predict the year a song was released, we focused our efforts on a simplified dataset created from the Million Songs Dataset: http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD#

The dataset comes with 91 features (the year a particular song was released, 12 timbre average, and 78 covariance average). There is no implicit link via track id that signifies what song a particular training example is referring to, but we can deconstruct the timbre and segment attributes provided as features provided in the Million Songs Dataset to match these training examples with those songs-- future work and further analysis can focus on this area.

As described by the UCI website: Features extracted from the 'timbre' features from The Echo Nest API. We take the average and covariance over all 'segments', each segment being described by a 12-dimensional timbre vector.

Experiments and Results

The following is a discussion and log of the year prediction experiments that we ran. Most of these jobs were either launched on the Condor distributed computing system, or were run directly from our computers. The majority of these experiments utilized the Keras deep learning library. We decided to use Keras primarily because of the flexability provided by the library, as well as the detailed metadata provided for every job-- this data was especially useful in our efforts to tune key hyperparameters.

K-Nearest-Nieghbors Clustering: 100 Neighbors

The neural network models were not giving us great results, so we decided to look at the specific experiments run by researchers in the field. K-NN seemed to be a common approach, so we implemented a K-NN with k = 100, as a means of observing the manner in which decision boundaries might be drawn.

Orignally, we noticed only a 5.7% accuracy on our test set, which is lower than the performance of our neural network models. This opened a new line of inquiry: just how off are our predictions?

For every prediction and corresponding label, we calculated the absolute value of the difference of these values (i.e, if the test label is 1988, and the prediction was 1995, the absolute value of the difference is |1988 - 1995| = 7 years). This calculation gave us a measure of how far off a given prediction was. Doing this for every prediction and taking the average across that resulting vector showed that the average prediction was only off by about 8 years. This was tremendous news, because it meant that our data was perhaps not good enough to be able to fuel a model to predict the exact year. If we redefined our measure of success to determine a given generation a song was released (determining whether a prediction is within 10 years of it's actual date), our model has an accuracy of 76.42%. Future work can focus on more focused parameter tuning and possible transformations to the data.

Parameter Tuning Using GridSearchCV

In order to tune our parameters, we used the GridSearchCV function provided by scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

This algorithm takes quite a bit of time to complete it's parameter sweep, so as opposed to doing 10-fold cross validation (as the literature suggests) we ran 3-fold cross validation as a job on condor. This experiment provided a 10.3% accuracy on the test set, which was the highest exact accuracy that any our neural net models accomplished. Future work can look towards doing a more streamlined, more efficient parameter sweep given the context of a particular problem.

Semi Deep model: 50 Hidden Layers

  • Hidden Layers: 50

  • Hidden units: 100 for each of the 50 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 150000 examples

  • Validation split: 0.1

  • Epochs: 400

  • batch_size: 15

  • feature scaling: Yes

  • Optimizer: SGD

  • Learning rate: 0.01

  • Decay rate: 1e-6

  • Neterov: True

Performance:

This model achieved 7.8% accuracy on the test set, which is emblamatic of the accuracy levels we had achived up till this point. From the model accuracy plot above, we can see that the accuracy of the test set follows the accuracy of the training set, although there is major oscillation which might be an issue with our learning rate. The model loss shows that the training set loss decreases as we might expect it to in a low bias setting, but the test loss is much lower than the training loss, future work should investigate the cause of this.

Extra deep net: 250 Hidden Layers

  • Hidden Layers: 250

  • Hidden units: 100 for each of the 250 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 150000 examples

  • Validation split: 0.1

  • Epochs: 400

  • batch_size: 32

  • feature scaling: Yes

  • Optimizer: SGD

  • Learning rate: 0.001

  • Decay rate: 1e-6

  • Neterov: True

Performance:

This model achieved 6.5% performance on test set, which is bit of a decrease in the accuracies we had been getting to this point. The accuracy plot for this experiment is rather odd, indicating major oscillations between the 1st and 5th epoch, as ell as between the 100th and 150th epoch. The model loss plot shows that the loss of the training and test set follows an almost identical trend, which tells us that we are moving in the right direction in using a deeper neural net architecture.

Semi-Deep Net: 25 Hidden Layers

  • Hidden Layers: 25

  • Hidden units: 100 for each of the 25 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 150000 examples

  • Validation split: 0.1

  • Epochs: 200

  • batch_size: 32

  • feature scaling: Yes

  • Optimizer: SGD

  • Learning rate: 0.001

  • Decay rate: 1e-6

  • Neterov: True

Performance:

This model achieved 8.1% on the test set, which is a bit higher than the accuracies we have been observing up to this point. The model accuracy shows that the training and test set are again matching each other in terms of the trend of the function. The test accuracy oscillates wildly between epochs, which may be a function of our learning rate being to large and causing the model to jump around local minima. The model loss plot is emblematic of the performance of the deep net models we have been observing up to this point-- the training and test loss mirror each other.

Simple Model: 5 Hidden Layers

  • Hidden Layers: 5

  • Hidden units: 100 for each of the 5 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 10000 examples

  • Validation split: 0.1

  • Epochs: 200

  • batch_size: 32

  • Feature scaling: Yes

  • Optimizer: SGD

This model achieved 7% accuracy on the test set. This was one of the simpler models that we developed and tested, but is still interesting to observe. In particular, the model accuracy told us that we had an overfitting problem-- if we wanted to address this, it might requires that we focus on developing a model that is definitely overfitting-- in order to improve performance, it might make sense to add more data, regularization, and tune the neural network architecture. The model loss plot highlights a similar trend, the loss of the test set steadily increases, while the loss of the training set decreases. The large gap between these accuracies can be closed by adding more data-- a trend observed in the experiments above.

Mini Deep Net: 10 Hidden Layers

  • Hidden Layers: 10

  • Hidden units: 100 for each of the 5 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 10000 examples

  • Validation split: 0.1

  • Epochs: 200

  • batch_size: 32

  • feature scaling: Yes

  • Optimizer: SGD

Performance: 9.2% accuracy on validation set.

This model has produced the highest test set accuracy thus far: 9.2%. Observing the model accuracy plot shows that the training and test accuracies follow the same general trend, but the test set accuracy oscillates more wildly-- possibly a function of our learning rate. The model loss plot is indicative of the other networks with more layers that we've developed thus far: the training and test set accuracies decrease together.

Simple Net: 5 Hidden Layers, 30,000 examples, No scaling

  • Hidden Layers: 5

  • Hidden units: 100 for each of the 5 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.00001 (Same for all hidden layers)

  • Training: 30000 examples

  • Validation split: 0.1

  • Epochs: 200

  • batch_size: 32

  • Feature scaling: No

  • Optimizer: SGD

Performance:

This experiment was meant to measure the effect of feature scaling on our models. It achieved a 6.2% test set accuracy, which is a bit lower than the accuracies we have been observing to this point. The model accuracy plot is similar to the plot we observed for our simple model: the training set accuracy goes up steadily, but the test set accuracy remains relatively stagnant. The model loss plot shows that the gap between the test and train loss has decreased-- this can be a function of training on more data.

Deep Net: 200 Hidden Layers

  • Hidden Layers: 200

  • Hidden units: 10 for each of the 100 hidden units

  • Activation Function = Relu For hidden layers, softmax for output layer

  • Regularization = 0.0001 (Same for all hidden layers)

  • Training: 100000 examples

  • Validation split: 0.1

  • Epochs: 100

  • batch_size: 15

  • Feature scaling: Yes

  • Optimizer: SGD

Performance:

This model achived a 6.5% accuracy on the test set-- which is on the lower end of our other observations. The model accuracy plot shows the test set oscillating wildly, while the train accuracy remains relatively stagnant. The model loss plot shows that the test set and training set both decrease steadily, at an almost identical rate.

Support Vector Machines for Year Prediction

We implemented a support vector machine using scikit-learn's SVM library in order to produce year predictions. We also applied the GridSearchCV algorithm using 3-fold cross validation in order to run a parameter sweep for C and Gamma. These were part of our initial observations so we did not read to0 deeply into our results here, but we did manage a maximum of 31% accuracy on our test set.

Project Notes and Updates

12/2/2017

In order to develop a more iterative pipeline for our work, we've started submitting jobs to the Condor computing system available within Middlebury's CS lab. We've begun testing various models for our data-- confronting the lyrics data as well as the year prediction data. We are continuing to iterate the neural net model that we've developed for year prediction-- in particular, we are exploring libraries (such as keras) that will allows us to iterate over our experiments more quickly and efficiently. We are also looking to iterate over our k-means clustering model more efficiently as a means of developing a solid understanding of how various aspects of our data are related. For these experiments, we have been testing code that we've written alongside existing libraries such as within Scikit-learn.

We currently have 2 jobs running on the Condor system, both of which are focused on year prediction. In particular, we are testing a large neural network architecture (20 layers with 1000 activation units, with a few smaller units) with 100,000 training examples, as well a grid search for a neural net with 50,000 training examples.

From running several experiments with the keras framework, we've been able to conclude with a high degree of accuracy that we have a high variance problem-- our model does not generalize well. Future experiments will build off of this understanding.

11/27/2017

We've begun running experiments in parallel-- in addition to building a neural net and SVM to tackle year prediction, we are also working of lyrics that apply to a subset of the songs in the Million Song Dataset to create a clustering algorithm that can infer things like tempo, loudness, and other features from lyrics. We hope to extrapolate mood from these clusterings by analyzing the features of songs that end up in the same cluster.

11/25/2017

Ran a few more experiments testing the network architecture. I attempted to overfit (training and testing on the same data, setting alpha (regularization parameter) to 0). I noticed that when trained and tested on a small amount of data (say, 10 examples), the model is able to overfit. Increasing the amount of data you train and test on decreases accuracy-- to counteract this, we add hidden nodes and layers to gain back some of that accuracy. More data tells us that we need more layers/hidden nodes. How can we tune this in an efficient manner, and why is this the behavior of the model thus far?

11/24/2017

We added a new subdirectory called experiment-notebooks/ which can serve as a more detailed documentation of the experiments that we will be running. In addition to the code that we directly push, these notebooks will come with explanations on how we got a particular result. The hope is that this effort will make our work more transparent, easy to follow, and reproducible.

Also ran an SVM experiment that matches the neural net experiment. It uses the same data, same training/test split, and actually produces the same accuracy... which is interesting. We can look more into this, I've just been throwing some models at the data and seeing what happens. 7.5% accuracy gives us a lot of room to move up.

11/23/2017

Currently working on a crude neural net that can classify what year a song was released. This implementation is run on the simplified dataset focused on year classification (see the home page to download and use this dataset). Only getting 7.5% accuracy so far, with 10000 training examples and 1000 test examples. Network archiecture has 4 hidden layers, each with 100 nodes. Alpha = 0.001.

We can likely tune these hyperparameters on a cross validation set to see if we get an increase in performance. Neural nets perform better with more data, but using more data significantly slows down training, so there is a trade off between how many examples we train on and how long the model takes to train. If we want to use a GPU enabled system, we might need to look at other libraries for our implementation (see resources & documentation page)

It would be helpful if we could explore the data a bit further and try other experiments with various algorithms-- if we could get started with clustering songs and identifying artists within clusters, that would be cool.