Author: Jozef Porubcin
Link to Google Drive: https://drive.google.com/drive/folders/1OdbK8DzVpr1IdABsBJGbgvoQci0r_zE7?usp=sharing Link to Google Colab: https://colab.research.google.com/drive/1wFRrAmZO6DuUxexLTW4hSHNDk4qR7asq?usp=sharing
The .ipynb (linked above) can be run as it is, with just some exceptions. The Google Drive link is given in case it makes running the notebook easier, since the relative path to the fruits dataset directory and other file/directory paths may make the code require some edits. Once it runs, at the very end of the notebook, the output is a single fruit image selected from the dataset, but with increasing levels of noise. You can rerun the code multiple times to see a different fruit image, and the model's predictions.
The size of the testing data subset is 1150. The size of the training data subset is 9195. The size of the validation data subset is 1149. The original dataset had far too many classifications to train a model on in a reasonable amount of time, so a subset of 14 of the classifications were chosen. The only metric for ruling out a given classification was its size. To train the model fairly, an equal number of samples from each classification are required, so the size of the subset of the dataset is equal to the product of the number of chosen classifications and the size of the classification with the fewest samples among the chosen classifications. I chose an arbitrary number of 14 for the number of classifications and I selected 14 classifications with high sample counts relative to the rest of the classifications in the dataset. After extracting a subset of the dataset, each classification had at least 821 samples. I believed it would be better to train for depth rather than breadth, so I valued classification sample sizes over the number of classifications when selecting the data. That said, I still wanted there to be numerous classifications to lower the chance that the model could randomly select a classification and be correct.
While 821 is a good number of samples per classification, I still would have liked to have had more so that the training data subset could be larger. Since I wanted a large training data subset, I chose to dedicate 80% of the data for training. The validation data and testing data subsets would be 10% each to make up the remainder of the data. The sizes of the latter two subsets are large enough to reliably determine the accuracy of the model on data it has never seen, but small enough to allow enough samples to be used for training. Given that the testing data subset had 1150 images total and the model accurately identified over 99% of them, I believe that the 80/10/10 split was ultimately sufficient in testing the model's ability to generalize its knowledge to correctly classify new data.
The exact classification accuracy varies based on the amount of noise in the data. The model is trained and validated on images with no noise, but the model can be tested on images with arbitrary amounts of noise. To be precise, the way noise is added to an image is as follows. Each pixel of an input image has a chance of being modified. If a pixel is randomly chosen to be modified, its RGB values will be replaced by random RGB values. The probability that pixels will be modified is chosen before the model receives the image, and is not modified until all of the pixels are evaluated. With a 0% chance, the most recent evaluation of the model achieved an accuracy of 99.65%. That means it incorrectly classified 4 images and correctly classified 1146.
What's interesting is how the accuracy sloped downwards with respect to the amount of noise in the image. If the curve was linear, one would expect the curve to be a straight line from (0, 0.9965) to (1, 1/14), because with 0% noise (x = 0), we would expect the accuracy to be 99.65%, and with 100% noise (x = 1), we would expect the accuracy to be a random guess among the 14 classifications (1/14). I hypothesized that the curve would not be linear, but would instead resemble the shape of a sigmoid function - that is to say that the curve should slope slowly initially, before decreasing most rapidly at x = 0.5, and then evening out again as x approaches 1. Since the domain of the sigmoid function is all real numbers instead of [0, 1], the curve I imagined is more accurately described as a cosine function. y = (0.5 * cos(PI * x) + 0.5) * (0.9965 - (1/14)) + (1/14) [https://www.desmos.com/calculator/ogpoy4aegf] is exactly this. In the graph, the actual accuracy scores are plotted with respect to the amount of noise in the image from 0 to 1 in steps of 5%. As you can see, the accuracy does vaguely follow the cosine curve, however it seems that the accuracy slopes more rapidly when x > 0.3.
When compared to the validation scores, the test accuracy scores are lower because the model is essentially trained based on the validation subset because its weights get updated based on whether its accuracy score improved since the last validation score. As for the training accuracy score, it's the highest because the weights are constantly trying to be adjusted so that training accuracy increases. Some ways to lower error rates could be to increase training time (though this has already been explored, and the room left for improvement is very small), or add layers.