Skip to content

Building a machine learning model that attempts to predict whether a loan will become high risk or not. Used sklearn, and pandas.

Notifications You must be signed in to change notification settings

samuelroiz/Predict_Credit_Risk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predict_Credit_Risk

Built a machine learning model to predict whether a loan will become high risk for a company named LendingClub. LendingClub is a peer-to-peer lending service that allows individual investors to partially fund personal loans and buy and sell notes backing the loans on a secondary market. LendingClub provides its historical data through an API. We utilized this data to create machine learning models to classify the risk level of loans, comparing the Logistic Regression model and Random Forest Classifier using sklearn and pandas.

Machine Learning Steps

In the given 2019 and 2020 Q1 data frames, not all of the column values were numeric. Non-numeric columns lead to issues because machine learning models require numeric values. In the training data frame, we identified seven columns that contain non-numeric values.

2019 Train Table

2020 Q1 Test Table

Both data frames have over fifty columns. The following code will help identify the columns containing object values, which are the same as string values.

To convert the non-numeric columns, apply the get_dummies() method.

pd.get_dummies()

Here is an example of how pd.get_dummies() work.

Get_Dummies() Part 1

The following data frame, named preview_get_dummies, displays two non-numeric columns.

Then apply the get_dummies() code: pd.get_dummies( preview_get_dummies )

Get_Dummies() Part 2

Now the preview_get_dummies data frame has numeric columns which means it meets the standards of machine learning models.

After converting the non-numeric columns into numeric columns, the data frame needs to have an x_value and y_value, especially in machine learning models. The x_value and y_value for this case will be the cleaned-up data frames. The y_value will hold the 'loan_status' object values. The x_values will have the whole data frame with numeric values only. The x_value should drop the 'loan_status' column or the model will fail.

Y_values List

X values List

Once the x and y values are assigned to train and test, the x values must be in the same shape. If the train and test shape are not equal, the whole model fails. The data frame of the test had 91 columns. The data frame of the train had 92 columns. Therefore, the two data frames cannot be used for the models. To solve this issue is running a for a loop by its columns. If its columns are not found in its data frame, it would have a column that holds numeric values of 0. The image below will display the code.

Code Fill Missing Columns

LogisticRegression Model() and RandomForestClassifier() without StandardScaler

The logistic regression model needs four values for fit and score. The fit values will end up as the train. Score values will end up as the test. The solver will be equal to 'lbfgs' and random_state will be 1. The random state sets seed to the random generator. Depending on the random state will determine the train-test splits. If the random_state is not set, the outcome runs differently. LBFGS stands for "Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm" and is one of the algorithms in the Scikit-Learn library.

Logistic Code Model Code

After applying the logistic code, the train values need to be fitted under the logistic code. The fit method is equal to or the same as training. Once trained, the model can be used to make predictions. It will try to find coefficients that will fit the equation defined via the algorithm being used. Again, applying the fit method will allow the model to use the predict method.

The code after fit method will be the scoring method. The score method takes a feature matrix X_test and the expected target values y_test. It will predict the x test while comparing it with the Y test and return the accuracy. The accuracy can be called the R squared score which is the regression estimator. The accuracy outcome is 0.5168013611229264. An accuracy of 0.52 is poor and the model is not accurate.

Random Forest Classifier Code

Random forest classifier is another model that decision trees based on a random selection of data samples. It will get predictions from every tree and selects the most promising solution through votes. It uses the fit and score method also. The outcome of the score is 0.6424925563589962. Random forest classifier accuracy is stronger and higher than the logistic regression model, but still weak.

LogisticRegression Model() and RandomForestClassifier() with StandardScaler

How does the Standard scaler work and what is it doing to the data?

Standard Scaler has a formula, let's assume the formula is z = (x - u) / s where z is the scaled data, x is to be the scaled data, u is the average of the training samples, and s is the standard deviation of the training samples. The average or mean is the sum of all data points divided by the number of data points. In this case, u is going to add up all of the training samples divided by several training samples. The standard deviation formula starts by taking the given values and placing them in one column. Square each value and place them in a second column. Find the sum of all values in the first column and square it. The value is divided by the number of data points in first the column and called this number i . Find the sum of all values in the created second column. Once find the value, subtract it by i . The outcome will be divided by the number of data points minus one. Value leads to the variance of the sample and data. Finally, the variance will be square rooted leading to the value standard deviation of the data.

Example of Standard Scaler

To demonstrate the algorithm and how it functions, consider the data set {1,2,3,4,5}. The data set consists of 5 one dimensional data points and each data point has one feature. Now apply the standard scaler() to the data. The data set becomes {−1.41,−0.71,0.,0.71,1.41}.

This is how the math works behind the Standard Scaler.
 The steps to apply StandardScaler to the data set [1,2,3,4,5]:
 

  1. Calculate the mean of the feature vector X:

  2. mean = (1 + 2 + 3 + 4 + 5) / 5 = 3

  3. Calculate the standard deviation of the feature vector X:

  4. std_dev = sqrt( ((1 - 3)^2 + (2 - 3)^2 + (3 - 3)^2 + (4 - 3)^2 + (5 - 3)^2) / 5 ) = 1.4142

  5. Subtract the mean from each element of X:

  6. X = [1 - 3, 2 - 3, 3 - 3, 4 - 3, 5 - 3] = [-2, -1, 0, 1, 2]

  7. Divide each element of X by the standard deviation:

  8. Z = [-2 / 1.4142, -1 / 1.4142, 0 / 1.4142, 1 / 1.4142, 2 / 1.4142] = [-1.4142, -0.7071, 0, 0.7071, 1.4142]

So the standardized feature vector Z is [-1.4142, -0.7071, 0, 0.7071, 1.4142].

Note that when applying StandardScaler to a data set, the fit method is used to calculate the mean and standard deviation, and the transform method is used to apply the scaling to the data. In this case, since we only have one feature, we could use the fit_transform method to combine these two steps:

from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
print(X_std)

This will produce the output:

[[-1.41421356]
[-0.70710678]
[ 0. ]
[ 0.70710678]
[ 1.41421356]]

The following example takes all of the data points and converts them to a closer range of 0 to 1. Standard scaler helps prevent outliers and keep the data closer to each other rather than gaps.

Given a feature vector X of size n:
  X = [x1, x2, ..., xn]

The StandardScaler calculates the standardized feature vector Z as follows:

  1. Calculate the mean of the feature vector X:

  2.   mean = (x1 + x2 + ... + xn) / n

  3. Calculate the standard deviation of the feature vector X:

  4.   std_dev = sqrt( ((x1 - mean)^2 + (x2 - mean)^2 + ... + (xn - mean)^2) / n )

  5. Subtract the mean from each element of X:


  6.   X = [x1 - mean, x2 - mean, ..., xn - mean]

  7. Divide each element of X by the standard deviation:

  8.   Z = [ (x1 - mean) / std_dev, (x2 - mean) / std_dev, ..., (xn - mean) / std_dev ]

The resulting standardized feature vector Z has a mean of zero and a standard deviation of one, which is useful for ensuring that different features are on the same scale and for improving the performance of certain machine learning algorithms.

Standard Scaler Code

To get better accuracy for our model, apply the scaler method. The point of StandardScaler is transforming the data to have a mean value of 0 and a standard deviation of 1. Applying the StandardScaler should improve the accuracy of the LogisticRegression Model() and RandomForestClassifier().

Logistic Regression Model with Scaler

The LogisticRegression Model() will have the same code except X_train and X_test will be replaced by X_train_scaled and X_test_scaled. After replacing the test and train values, the accuracy improved from 0.5168013611229264 to 0.767333049766057. The score increased by 22.

Random Forest Classifier with Scaler

The RandomForestClassifier Model() will have the same code except X_train and X_test will be replaced by X_train_scaled and X_test_scaled. After replacing the test and train values, the accuracy decreased from 0.6424925563589962 to 0.6339855380689069. The standardscaler did not improve the RandomForestClassifier.

LogisticRegression Model() and RandomForestClassifier() Comparison

In this specific model, the standard scaler improves the Logistic Regression Model. It improved from 0.5168013611229264 to 0.767333049766057 and an improvement for the Logistic Regression model. Applying the standard scaler worked for the logistic regression model is because the data has outliers and gaps from each data point. Outliers and gaps decrease the accuracy of the model. However, applying the standard scaler does not always help the model's accuracy.

  StandardScaler can potentially improve or weaken the accuracy of a model, depending on the nature of the data and the modeling algorithm being used.

In general, StandardScaler can help improve model accuracy by ensuring that all features are on the same scale. This is important for modeling algorithms that are sensitive to the scale of the features, such as those based on distances or gradients (e.g., k-nearest neighbors, gradient descent-based algorithms). By scaling the features to have zero mean and unit variance, StandardScaler can help these algorithms converge more quickly and produce better results.

On the other hand, StandardScaler may not be helpful or may even hurt model accuracy if the data is already well-scaled and the algorithm being used is not sensitive to feature scaling. In some cases, scaling the data can introduce noise or outliers that can negatively impact model performance. Additionally, if the data has a highly skewed distribution, scaling may not be appropriate and other techniques such as normalization or log transformation may be more effective.

Ultimately, the effect of StandardScaler on model accuracy will depend on the specific dataset and algorithm being used, so it's important to evaluate the impact of feature scaling on model performance in a systematic way.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Versioning

We use SemVer for versioning. For the versions available, see the tags on this repository.

Authors

  • Samuel Roiz - Data clean, Analyzed Data, Math Model - Profile

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

  • LendingClub
  • USC Data Visualization
  • CSUN Mathematics

About

Building a machine learning model that attempts to predict whether a loan will become high risk or not. Used sklearn, and pandas.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published