- Missing rate:
- 40% of features in the dataset have missing values;
- 8 features have almost 50% of missing values. To deal with this I will use Imputer transformer later during data processing.
-
The provided dataset.csv was splitted into train/test by
default
target column values presence. -
Dataset is drastically imbalanced - there is skewness in target values. Minority class is ~1,5% of the training dataset.
-
There are no significant correlations between features and the target value (the heights corr value is 0.2 for avg_payment_span_0_12m); There is multicollinearity - some features are positively correlated with each other, which can be observed on the heatmaps in
Model.ipynb
. To deal with multicollinearity I will exclude correlated features while passing featureset to linear models.
- Check the feature types and data format.
- Extract categorical and numerical features as they require different processing transformations. I also split categorical features into those which have low cardinality values and high cardinality values. Having not many data points it is a bad practice to explode dataset via OneHotEncoding of high cardinality categorical features, so that I will apply Target encoding to such features during data processing.
- Check for outliers. Almost all numerical features have on average 5%-10% of outliers detected by the interquartile range approach (Boxplot). Linear models are especially sensitive to outliers. As a future improvement, an outlier removal could be incorporating into the data processing pipeline.
- Define, train and evaluate various models. The model base is implemented in
BaselineModel
class. As a baseline model, I chose a simple logistic regression. I'll compare it with 3 other types of models:
- LogisticRegression without multicollinearity
- RandomForestClassifier
- LGBMClassifier
- For model validation and hyperparameter tuning I use
GridSearchCV
withStratifiedKFold
framework. For every set of hyperparameters, training is running on a part of the training dataset and evaluation procedure on a hold-out set. This prevents overfitting, searches for the best hyperparameters, and ensure unbiased validation. - Training pipeline is build in
BaselineModel
and contains two steps:
- preprocessing input data (
StandardScaler, OrdinalEncoder, TargetEncoder, ColumnSelector, SimpleImputer
) - training classifier
GridSearchCV
is fitted by a training pipeline to prevent data leakage.
- Since we have imbalanced target variable, the metrics for evaluation should be selected accordingly. For the evaluation and comparison I will use following metrics:
- F1 score
- Precision recall curve
- Avg precision score - is a numeric representation of precision recall curve
- ROC AUC score
You can find detailed model comparison in
Model.ipynb
Look at impurity-based feature importances prodiced by RandomForest estimator in Model.ipynb
.
age
andavg_payment_span_0_12m
are significantly important features.- There are Top-5 features of the highest importance and the importance of the rest features is slightly decreasing.
merchant_category
is categorical feature with high cardinality encoded with Target encoder plays a great role.
- Train and evaluate Deep Learning classifiers if more datapoints could be provided.
- Test and evaluate different transformers in the preprocessing stage especially for categorical variables.
- Implement outliers removal in the preprocessing pipeline.