The aim of this project is to use deep learning model to classify the plant seedling by using a supervised learning technique.
The data-set is available on Kaggle : (https://www.kaggle.com/c/plant-seedlings-classification/data).
The data-set is split into two group, which are training and testing data.
According to the above graph, the highest number of the plant is Loose-Silky-Bent, and the lowest number of the plants are Maize and Common Wheat.
The training and testing images have been processed by using OpenCV libraries that extracted the plant seedling only and removed the background noise. The filtering process depending on the HSV values, retaining green HSV parameters and convert back to RGB format, which means only the green colour remains and the rest of the colour are removed. The pre-processed image has been shown below:
Then the training and testing dataset have been normalized by dividing 255.0 to limit the pixel values within 0 to 1 and the labels are one-hot-encoded.
CNN is a good choice while dealing with the image data. Designed CNN architecture based on personal experience, knowledge and, most important, the machine learning community and forum helps. The time spent a lot on tuning the model hyper-parameters in order to achieve higher accuracy and lower residual for model training. So that the model predicting the unseen data will have a higher chance to obtain the correct result. Of course, there is plenty of other powerful CNN available such as AlexNet, ResNet and more, those networks may also suitable applying in this data-set.
The validation data-set is getting from the training data-set. For instance, the 100% training data-set split out its 10% data-set as validation data-set, which means 10 % treat as validation data-set and 90% treat as training data-set.
According to the confusion matrix, Sugar Beet and Black-Grass have misclassified obviously after the validation and training data-set fit into CNN model. There are 10 samples of Sugar Beet misclassified as Black-Grass, and 7 samples of Black-Grass misclassified as Sugar Beet. This means both plant images may having similar features that confuse the CNN model. The solution could be getting more data-set, apply alternative image processing techniques, more data augmentation or modified or change the current CNN.
Above graph showing the loss and accuracy of the training and validation data after both data-set fittings into the model. The x-axis is the epoch, the loss is decreasing and accuracy is increasing when epoch getting larger. At the end of the epoch, the validation accuracy is greater than the training accuracy that means the model doesn't overfit.
Above picture was getting from my Kaggle competition result. The trained model predicted the unseen data and the result shows 0.91939(92%) accuracy. The remaining 8% (100%- 92%) could be the Sugar Beet,Black-Grass, and a small number of other plants have misclassified.
The Sugar Beet and Black-Grass misclassified after training causes the model unable to differentiate both of them and directly affecting the accuracy when predicting the unseen data. The solution could be getting more data-set, apply alternative image processing techniques, more data augmentation or modified or change the current CNN. This will be the future work.
This Kaggle competition is challenging at least for me, because I just started to study this deep learning field a few months ago. I 'google' numerous blog,forum, documentation and more in order to let myself having intuition understanding how does deep learning work. Still, there are too many knowledge and techniques almost make me feels overwhelming. Anyway, it is great experience to know how to build a simple CNN and gain the significant skills. Even though deep learning network is quite hard to master, however, it could solve the real world problems, which make me feels exciting and motivating!
Google Colab
- Keras 2.1.6
- Python 3
- Opencv 3.4.3
- sklearn 0.19.2