- Merge three dataframes into one by matching store id.
-
Deal with missing values
- Filling missing values of markdowns with 0 is better than deleting them.
0
is meaningful because it can presents there is no markdown on that date.
-
Deal with outliers
- Find outliers by Boxplot and IQR method.
- Deal with outliers of unemployment by replacing with nearest values.
- Map
Type
andIsHoiday
to numeric features. - Split Date into Year, Month and Day so that it will be numeric.
- Use Dept and Size create new meaningful features according to correlation anlysis:
Dept_num
: departments quantity of a store may affects sales.Size_per_dept
: average size of departments of a store may affects sales.
- Use pie charts to visualize the proportion of each Type, Year, Month, and IsHoliday.
- the distribution of Year and Month are balanced
- the distribution of Type and IsHoliday is not balanced, which may affect the model performance.
- Visualize sales varying with dates of a specific Dept of a specific Store by filtter of Power BI.
- sales data has a contrary relationship with temperature.
- sales data has not obvious relationship with fuel prices and unemployment.
- Size(0.24), Dept(0.15) and Type(-0.18) have stonger correation with Weekly_Sales.
- New feature Dept_num(0.16) and Size_per_dept (0.24) have stonger correation with Weekly_Sales.
- The data of
Type=0
andType=1
are similar, but they are different fromType=2
. We can try reclassifying them into two types.
- Use high correlation columns in part 1 correlation analysis
- Fix zero values of Weekly_Sales.
- Split dataset acoording to date. train:test = 4:1.
- Train LinearRegression, KNN, RandomForest, and get scores on test dataset.
- Compare MAE, MAPE, RMSE, R2 metrics.
- Performance of KNN and RandomForest are similar and they are vey higher than LinearRegresiion
- Use KNN and RandomForest to build a VotingRegressor which has a better performance.
- Using all columns to train have a better curve fitting
- RandomResizedCrop: and RandomHorizontalFlip are useful.
- RandomRotation, RandomVerticalFlip and Normalizea are unuseful.
- ResNet
- Classic CNN network.
- Very easy to achieve 90%+ score on this dataset.
- ViT
- Use transformer framework on vision tasks.
- Training more slowly than ResNet.
- Design unified pipelines to train model and test on test dataset.
- ResNet
- larger framework is not useful.
- Adam and StepLR is better.
- larger learning rate is batter (1e-3 > 1e-4).
- ViT
- Because of slow convergence speed, MultiStepLr can set a large learning rate in the beginning to accelarate training.
- Compare ResNet and ViT with following metrics:
- Precision
- Recall
- F1 score
- Overall Accuracy
- ROC curve
- AUC
- Confusion matrix
- Accuracy and AUC of Resnet are higher than ViT.
- Both of models are more likely to confuse airplane and dog.
- Using Ensemble Learning method, use Resnet and ViT vote for the pridiction result.
- Weights of them is Resnet:ViT = 3:1.