All of the steps outlined in the preprocessing plans section above has been completed. We did run into some issues in imputing, namely in instances where when we tried to compute the mean of all iterations of a particular class with NaN values, we would get another NaN value because ALL other iterations of that class also had NaN values for the feature we are trying to take the mean of. For observations (classes) with these properties, we have addressed them by taking the mean of all classes in that department. Despite doing so, there were still a few classes with NaN values for some feature columns. So we remedied that by just assigning it the value equal to the mean of all observation values in the dataset for the column with the issue. This has reduced the number of NaN values in our dataset to 0.
Our first model is an DNN with 5 hidden layers all using the Relu activation function and with a decreasing number of nodes in each layer. Since we are doing regression where we need to predict a continuous value, the output layer has no activation function. We will be using the adam optimizer and mse as our loss function. Early stopping and checkpointing has also been set up for the model's fit function.
Our model's initial performance was quite good. Our model managed a MSE loss value of 0.006070451192523905 on the training data and a value of 0.0059246622968575175 on the test. Thus, it looks like our model performs pretty similarly when presented with unseen test data as it does with the data it was trained on. In fact, the results seem to indicate that the model performs slightly better with the provided test data than the training data as the test MSE is slightly lower than the training MSE. We belive this is probably just due to luck (the seed that we used to split the data) and that if we performed k cross fold validation we may see slightly different results. Ultimately, though the loss is not perfect (MSE is not 0) for either dataset, the loss values that we do get are rather low, which suggests that our model would do a good job at preducting the average GPA received for a class given the values for "Total Enrolled in Course", "Percentage Recommended Professor", "Study Hours per Week", and "Average Grade Expected". These are quite promising results for our first model!
Plotting the training vs validation loss reveals that our model mostly lies in the "Good fit" section of the Underfitting/Overfitting graph. In our plot, we can see that as the number of epochs increases, the training loss continues to get lower while the validation loss largely follows suit. While there is some divergence (as evident by the varying peaks and valleys in the validation loss), they do not deviate too far from the training loss and show a general trend downwards that mirrors it as well. Likewise, since the loss values are so low and are appearing to stabilize at a low value for both the training and validation data, this tells us that we are not underfitting, and since the validation loss does not start to consistently increase while the training loss decreases as the number of epochs increases, it appears that overfitting is not occurring to a significant degree as well, which means that our model sits somewhere in the good fit zone.
Per our milestone 1 submission, we aim to create 2 additional models from our data set, both of which are also regression problems. The first alternative model is one that attempts to predict how much students will enjoy a class (as determined by the % who recommended the class) given some of the other available features (i.e. "Total Enrolled in Course", "Percentage Recommended Professor", "Study Hours per Week", "Average Grade Expected", etc.), as this is one way to measure student enjoyment in a class. The other model we are planning on developing is one to predict student enagagement, particularly by estimating how many hours they would likely need to spend per week studying. Like the other model, we will also use some of the other features avaiable in our dataset as inputs to our model. This time, we will also use class department and upper/lower div classification as we know that class difficulty and study hours can differ greatly across departments and classes.
Conclusion section: What is the conclusion of your 1st model? What can be done to possibly improve it?
Overall, it seems that our first model is rather successful at predicting average GPAs received as evident by the low loss of its outputs in both training, test, and validation data. However, we feel that there is more work that we can do. Here is a list of some of the plans that we have to improve the performance of our model:
-
Hyperparameter tune. Mess around with different numbers of nodes in each layer, activation functions, loss functions, optimizers, and learning rates. Since we have some time before the next milestone is due, we can definitely have a member of our group tune our existing model to see if changing anything can help us reduce our loss
-
Add extra features to the dataset. One of our teammates realized that the evaluation urls present in the orignal dataset takes you to the CAPEs review for the observation it's associated with and contains a bunch of other data that we feel might give our model more relevant information that it can use to make better predictions. They are currently working on a webscraping script to extract such data so that we can add it to our dataset.
-
Run more epochs and increase the early stopping threshold. The trend in the training data's loss values as the number of epochs increases suggests that the model can be trained for more epochs to reduce the training loss. More work would need to be done to verify that overfitting does not happen or appear to get worse as a result of this however.
-
Cross fold validation. We can perform K fold validation to test our model's degree of overfitting. If we notice that the MSE for different folds varies greatly then we know that our model may be overfitting, which may reduce its performance when presented with unseen data.
-
Incorporating the discretized labels as input features. Currently our model is class agnostic, meaning that it does not consider department and class number (upper/lower div classification) at all. We plan on incorporating that as we feel that it is important as GPAs do differ from department to department.
-
Changing the normalization type. Standardizing the input data may result in better performance. We will experiment with this for the next milestone.