Skip to content

JGFuentesC/woe_credit_scoring

Repository files navigation

Contributors

Forks

Stargazers

Issues

MIT License

LinkedIn

Credit Scoring Toolkit

In finance is a common practice to create risk scorecards to assess the credit worthiness for a given customer. Unfortunately, out of the box credit scoring tools are quite expensive and scatter, that's why we created this toolkit: to empower all credit scoring practicioners and spread the use of weight of evidence based scoring techniques for alternative uses cases (virtually any binary classification problem).


Explore the documentation»
Report Bug

Request Feature

Table of Contents
  1. About The Project
    1. Discrete Normalizer
    2. Discretizer
    3. WoeEncoder
    4. WoeBaseFeatureSelector
    5. WoeContinuousFeatureSelector
    6. WoeDiscreteFeatureSelector
    7. CreditScoring
    8. Built With
  2. Installation
  3. Usage
  4. Contributing
  5. License
  6. Contact
  7. Citing
  8. Acknowledgments

About The Project

The general process for creating Weight of Evidence based scorecards is illustrated in the figure below :

alt text

For that matter, we implemented the following classes to address the necesary steps to perform credit scoring transformation:

DiscreteNormalizer

Class for normalizing discrete data for a given relative frequency threshold

Discretizer

Class for discretizing continuous data into bins using several methods

WoeEncoder

Class for encoding discrete features into Weight of Evidence(WoE) transformation

WoeBaseFeatureSelector

Base class for selecting features based on their WoE transformation and Information Value statistic.

WoeContinuousFeatureSelector

Class for selecting continuous features based on their WoE transformation and Information Value statistic.

WoeDiscreteFeatureSelector

Class for selecting discrete features based on their WoE transformation and Information Value statistic.

CreditScoring

Implements credit risk scorecards following the methodology proposed in Siddiqi, N. (2012). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.

Built With

(back to top)

Installation

You can simply install the module using pip

  • pip
pip install woe-credit-scoring

(back to top)

Usage

Dependencies

import pandas as pd 
from CreditScoringToolkit.frequency_table import frequency_table
from CreditScoringToolkit.DiscreteNormalizer import DiscreteNormalizer
from CreditScoringToolkit.WoeEncoder import WoeEncoder
from CreditScoringToolkit.WoeContinuousFeatureSelector import WoeContinuousFeatureSelector
from CreditScoringToolkit.WoeDiscreteFeatureSelector import WoeDiscreteFeatureSelector
from CreditScoringToolkit.CreditScoring import CreditScoring
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

Reading example data

#  Read example data for train and validation (loan applications)
train = pd.read_csv('train.csv')
valid = pd.read_csv('valid.csv')   

Defining feature type

#  Assign features lists by type, file contains "C_" prefix for continuous, and "D_" for discrete.
vard = [v for v in train.columns if v[:2]=='D_']
varc = [v for v in train.columns if v[:2]=='C_']

Normalize Discrete Features

#  In this example, we aggregate categories with less than 10% of relative frequency
#  into a new category called 'SMALL CATEGORIES', if new created category don't reach
#  given relative frequency threshold (10%) then the most frequent category is imputed.
#  All missing values are treatead as the separate category MISSING

dn = DiscreteNormalizer(normalization_threshold=0.1,default_category='SMALL CATEGORIES')
dn.fit(train[vard])
Xt = dn.transform(train[vard])
frequency_table(Xt,vard)
****Frequency Table  D_OCCUPATION_TYPE  ***


                  Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
Laborers                 166       0.166               166             0.166
MISSING                  325       0.325               491             0.491
SMALL CATEGORIES         395       0.395               886             0.886
Sales staff              114       0.114              1000             1.000




****Frequency Table  D_NAME_CONTRACT_TYPE  ***


                 Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
Cash loans              897       0.897               897             0.897
Revolving loans         103       0.103              1000             1.000




****Frequency Table  D_CODE_GENDER  ***


   Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
F         659       0.659               659             0.659
M         341       0.341              1000             1.000




****Frequency Table  D_FLAG_OWN_CAR  ***


   Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
N         668       0.668               668             0.668
Y         332       0.332              1000             1.000




****Frequency Table  D_FLAG_OWN_REALTY  ***


   Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
N         287       0.287               287             0.287
Y         713       0.713              1000             1.000




****Frequency Table  D_NAME_INCOME_TYPE  ***


                      Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  \
Commercial associate         225       0.225               225   
Pensioner                    179       0.179               404   
Working                      596       0.596              1000   

                      Cumm. Rel. Freq.  
Commercial associate             0.225  
Pensioner                        0.404  
Working                          1.000  




****Frequency Table  D_NAME_EDUCATION_TYPE  ***


                               Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  \
Higher education                      243       0.243               243   
Secondary / secondary special         757       0.757              1000   

                               Cumm. Rel. Freq.  
Higher education                          0.243  
Secondary / secondary special             1.000  




****Frequency Table  D_NAME_FAMILY_STATUS  ***


                      Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  \
Civil marriage               102       0.102               102   
Married                      620       0.620               722   
SMALL CATEGORIES             117       0.117               839   
Single / not married         161       0.161              1000   

                      Cumm. Rel. Freq.  
Civil marriage                   0.102  
Married                          0.722  
SMALL CATEGORIES                 0.839  
Single / not married             1.000  




****Frequency Table  D_NAME_HOUSING_TYPE  ***


                   Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
House / apartment         878       0.878               878             0.878
SMALL CATEGORIES          122       0.122              1000             1.000




****Frequency Table  D_FLAG_PHONE  ***


   Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
0         721       0.721               721             0.721
1         279       0.279              1000             1.000




****Frequency Table  D_WEEKDAY_APPR_PROCESS_START  ***


           Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
FRIDAY            145       0.145               145             0.145
MONDAY            170       0.170               315             0.315
SATURDAY          106       0.106               421             0.421
THURSDAY          235       0.235               656             0.656
TUESDAY           177       0.177               833             0.833
WEDNESDAY         167       0.167              1000             1.000




****Frequency Table  D_NAME_TYPE_SUITE  ***


               Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
Family                126       0.126               126             0.126
Unaccompanied         874       0.874              1000             1.000




****Frequency Table  D_HOUSETYPE_MODE  ***


                Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
MISSING                490        0.49               490              0.49
block of flats         510        0.51              1000              1.00




****Frequency Table  D_WALLSMATERIAL_MODE  ***


              Abs. Freq.  Rel. Freq.  Cumm. Abs. Freq.  Cumm. Rel. Freq.
MISSING              560       0.560               560             0.560
Panel                227       0.227               787             0.787
Stone, brick         213       0.213              1000             1.000

Check if Normalization process didn't produce unary features

unary = [v for v in vard if Xt[v].nunique==1]
unary
[]

WoE Based Best Feature Selection

#   Now we proceed with feature selection, we have special classes for each type of feature (discrete,continuous)
#   Discrete feature selector uses the given iv_threshold to select the best features only.
#   For continuous feature selector, a variety of methods are available for selecting the best features , namely:
#    -uniform: only uses equal-width discretized bins, selects number of bins with best IV value.  
#    -quantile: only uses equal-frequency discretized bins, selects number of bins with best IV value 
#    -kmeans: only uses discretized bins created by a K-Means clustering, selects number of bins with best IV value 
#    -gaussian: only uses discretized bins created by a Gaussian Mixture, selects number of bins with best IV value
#    -dcc: stands for Discrete Competitive Combination, creates segments for all individual methods and then 
#          selects the best method and its corresponding best number of bins for each feature.
#    -dec: stands for Discrete Exhaustive Combination, creates segments for all individual methods and then 
#          selects the best number of bins for each feature including every feasible method.
#
#   One can configure IV threshold, minimun/maximum number of discretization bins, whether or not to keep only
#   strictly monotonic segments and the number of pooling threads used in order to speed computations. 


Xt = pd.concat([Xt,train[varc]],axis=1) #Merge continuous features matrix with the normalized discrete predictors Matrix

wcf = WoeContinuousFeatureSelector()
wdf = WoeDiscreteFeatureSelector()

#  Perform feature selection
wcf.fit(Xt[varc],train['TARGET'],
        max_bins=6,
        strictly_monotonic=True,
        iv_threshold=0.05,
        method='dcc',
        n_threads=20)

wdf.fit(Xt[vard],train['TARGET'],iv_threshold=0.1)

#  Create new matrix with discrete and discretized best features 
Xt = pd.concat([wdf.transform(Xt[vard]),wcf.transform(Xt[varc])],axis=1)

features = list(Xt.columns)

#  Print selection results
print("Best continuous features: ", wcf.selected_features)
print("Best discrete features: ",wdf.selected_features)
print("Best Features selected: ",features)
Best continuous features:  [{'feature': 'disc_C_AMT_CREDIT_4_kmeans', 'iv': 0.08944178361036469, 'root_feature': 'C_AMT_CREDIT', 'nbins': '4', 'method': 'kmeans'}, {'feature': 'disc_C_AMT_GOODS_PRICE_3_quantile', 'iv': 0.09335492422758512, 'root_feature': 'C_AMT_GOODS_PRICE', 'nbins': '3', 'method': 'quantile'}, {'feature': 'disc_C_AMT_INCOME_TOTAL_3_gaussian', 'iv': 0.05045866117534799, 'root_feature': 'C_AMT_INCOME_TOTAL', 'nbins': '3', 'method': 'gaussian'}, {'feature': 'disc_C_OWN_CAR_AGE_3_kmeans', 'iv': 0.13493592841524896, 'root_feature': 'C_OWN_CAR_AGE', 'nbins': '3', 'method': 'kmeans'}, {'feature': 'disc_C_TOTALAREA_MODE_3_quantile', 'iv': 0.1259702243075047, 'root_feature': 'C_TOTALAREA_MODE', 'nbins': '3', 'method': 'quantile'}]
Best discrete features:  {'D_CODE_GENDER': 0.10698023203218116}
Best Features selected:  ['D_CODE_GENDER', 'disc_C_AMT_CREDIT_4_kmeans', 'disc_C_AMT_GOODS_PRICE_3_quantile', 'disc_C_AMT_INCOME_TOTAL_3_gaussian', 'disc_C_OWN_CAR_AGE_3_kmeans', 'disc_C_TOTALAREA_MODE_3_quantile']

WoE Transformation

#  Weight of Evidence Transformation
we = WoeEncoder()
we.fit(Xt[features],train['TARGET'])
Xwt = we.transform(Xt[features])
Xwt.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
D_CODE_GENDER disc_C_AMT_CREDIT_4_kmeans disc_C_AMT_GOODS_PRICE_3_quantile disc_C_AMT_INCOME_TOTAL_3_gaussian disc_C_OWN_CAR_AGE_3_kmeans disc_C_TOTALAREA_MODE_3_quantile
0 -0.397247 0.092313 0.185600 -0.095193 -0.099648 -0.230676
1 0.271721 0.092313 -0.309368 -0.095193 -0.099648 -0.035932
2 0.271721 -0.303484 -0.309368 -0.095193 -0.099648 -0.126945
3 0.271721 -0.303484 -0.309368 -0.095193 -0.099648 -0.126945
4 0.271721 0.092313 0.185600 -0.095193 -0.099648 -0.230676

Logistic Regression Parameter Learning

lr = LogisticRegression()
lr.fit(Xwt,train['TARGET'])
print("AUC for training: ",roc_auc_score(y_score=lr.predict_proba(Xwt)[:,1],y_true=train['TARGET']))
AUC for training:  0.6938732132419364

Scoring

#  In order to perform the scoring transformation, we need the WoE encoded data, 
#  the WoeEncoder fitted object and the logistic regression fitter object 
#  to produce a nice formatted scorecard
cs = CreditScoring()
cs.fit(Xwt,we,lr)
cs.scorecard
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
points
feature attribute
D_CODE_GENDER F 63
M 42
disc_C_AMT_CREDIT_4_kmeans (1381796.72, 2370559.5] 48
(456025.115, 869398.644] 49
(47969.999, 456025.115] 57
(869398.644, 1381796.72] 67
disc_C_AMT_GOODS_PRICE_3_quantile (276000.0, 630000.0] 49
(44999.999, 276000.0] 58
(630000.0, 2254500.0] 59
MISSING 12
disc_C_AMT_INCOME_TOTAL_3_gaussian (173250.0, 337500.0] 62
(28403.999, 173250.0] 53
(337500.0, 810000.0] 43
disc_C_OWN_CAR_AGE_3_kmeans (-0.001, 12.582] 74
(12.582, 41.679] 43
(41.679, 65.0] 41
MISSING 52
disc_C_TOTALAREA_MODE_3_quantile (-0.001, 0.053] 49
(0.053, 0.111] 54
(0.111, 0.661] 76
MISSING 52

Validation

Model Generalization

#  Applying all transformations to the validation data is now easy and straightforward
#  we can compute AUC to check model overfitting
Xv = pd.concat([wdf.transform(dn.transform(valid[vard])),wcf.transform(valid[varc])],axis=1)
Xwv = we.transform(Xv)
print("AUC for validation: ",roc_auc_score(y_score=lr.predict_proba(Xwv)[:,1],y_true=valid['TARGET']))
AUC for validation:  0.6971411870981454

Scoring Distributions

#  We can check the score transformation distributions for training and validation
score = pd.concat([pd.concat([cs.transform(we.inverse_transform(Xwv))[['score']].assign(sample='validation'),valid['TARGET']],axis=1),
pd.concat([cs.transform(we.inverse_transform(Xwt))[['score']].assign(sample='train'),train['TARGET']],axis=1)
                  ],ignore_index=True)

for s,d in score.groupby('sample'):
    plt.figure()
    plt.title(s)
    sns.histplot(d['score'],legend=True,fill=True,bins=8)

png

png

Event rates

#   Finally, we can observe that, the greater the score, the lower the probability of being a 
#   bad customer (label=1) for both samples. Now all complexity is absorbed   
score['score_range'] = pd.cut(score['score'],bins=6,include_lowest=True).astype(str)
for s,d in score.groupby('sample'):
    aux = d.pivot_table(index='TARGET',
                        columns='score_range',
                        values='score',
                        aggfunc='count',
                        fill_value=0)
    aux/=aux.sum()
    aux = aux.T
    plt.figure()
    ax = aux.plot(kind='bar',stacked=True,color=['purple','black'])
    plt.title(s)
<Figure size 432x288 with 0 Axes>

png

<Figure size 432x288 with 0 Axes>

png

(back to top)

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".

Don't forget to give the project a star! Thanks again!

  1. Fork the Project

  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)

  3. Commit your Changes (git commit -m 'Add some AmazingFeature')

  4. Push to the Branch (git push origin feature/AmazingFeature)

  5. Open a Pull Request

(back to top)

License

Distributed under the GNU General Public License v3.0 License. See LICENSE for more information.

(back to top)

Contact

José G Fuentes - @jgusteacher - [email protected]

Project Link: https://github.com/JGFuentesC/woe_credit_scoring

(back to top)

Citing

If you use this software in scientific publications, we would appreciate citations to the following paper:

Combination of Unsupervised Discretization Methods for Credit Risk José G. Fuentes Cabrera, Hugo A. Pérez Vicente, Sebastián Maldonado,Jonás Velasco

(back to top)

Acknowledgments

(back to top)

About

Tools for Credit Scoring

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published