Regression

Introduction to Regressions

Simple Linear Regression : y = b0 + b1*x1
Multiple Linear Regression : y = b0 + b1*x1 + b2*x2 + ... + bn*xn
Polynomial Linear Regression: y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)

Simple Linear Regression

Outline Building a Model

Importing libraries and datasets
Splitting the dataset
Training the simple Linear Regression model on the Training set
Predicting and visualizing the test set results
Visualizing the training set results
Making a single prediction
Getting the final linear regression equation (with values of the coefficients)

y = bo + b1 * x1

y: Dependent Variable (DV)
x: InDependent Variable (IV)
b0: Intercept Coefficient
b1: Slope of Line Coefficient

Creating a Model

Using sklearn.linear_model, LinearRegression model

from sklearn.linear_model import LinearRegression

#To Create Instance of Simple Linear Regression Model
regressor = LinearRegression()

#To fit the X_train and y_train
regressor.fit(X_train, y_train)

Predicting a Test Result

y_pred = regressor.predict(X_test)

Predict a single value

Important note: "predict" method always expects a 2D array as the format of its inputs.

And putting 12 into a double pair of square brackets makes the input exactly a 2D array:
regressor.predict([[12]])

print(f"Predicted Salary of Employee with 12 years of EXP: {regressor.predict([[12]])}" )

#Output: Predicted Salary of Employee with 12 years of EXP: [137605.23485427]

Visualising the Test set results

#Plot predicted values
plt.scatter(X_test, y_test, color = 'red', label = 'Predicted Value')
#Plot the regression line
plt.plot(X_train, regressor.predict(X_train), color = 'blue', label = 'Linear Regression')
#Label the Plot
plt.title('Salary vs Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
#Show the plot
plt.show()

Getting Linear Regression Equation

General Formula: y_pred = model.intercept_ + model.coef_ * x

print(f"b0 : {regressor.intercept_}")
print(f"b1 : {regressor.coef_}")

b0 : 25609.89799835482
b1 : [9332.94473799]

Linear Regression Equation: Salary = 25609 + 9332.94×YearsExperience

Evaluating the Algorithm

compare how well different algorithms perform on a particular dataset.
For regression algorithms, three evaluation metrics are commonly used:
1. R Square/Adjusted R Square > Percentage of the output variability
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE) > to compare performance between different regression models
3. Mean Absolute Error(MAE) > to compare performance between different regression models

R Square or Adjusted R Square

R Square: Coefficient of determination

R Square measures how much of variability in predicted variable can be explained by the model.
Variance is a measure in statistics defined as the average of the square of differences between individual point and the expected value.
R Square value: between 0 to 1 and bigger value indicates a better fit between prediction and actual value.
However, it does not take into consideration of overfitting problem.
- If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data
- but performs badly for testing data.
- Solution: Adjusted R Square

Adjusted R Square

is introduced Since R-square can be increased by adding more number of variable and may lead to the over-fitting of the model
Will penalise additional independent variables added to the model and adjust the metric to prevent overfitting issue.

Calculate R Square and Adjusted R Square using Python

In Python, you can calculate R Square using Statsmodel or Sklearn Package

import statsmodels.api as sm

X_addC = sm.add_constant(X)

result = sm.OLS(Y, X_addC).fit()

print(result.rsquared, result.rsquared_adj)
# 0.79180307318 0.790545085707

around 79% of dependent variability can be explain by the model and adjusted R Square is roughly the same as R Square meaning the model is quite robust

Mean Square Error and Root Mean Square Error

While R Square is a relative measure of how well the model fits dependent variables
Mean Square Error (MSE) is an absolute measure of the goodness for the fit.
Root Mean Square Error(RMSE) is the square root of MSE.
- It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily.
- Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and make it easier for interpretation.

from sklearn.metrics import mean_squared_error
import math
print(mean_squared_error(Y_test, Y_predicted))
print(math.sqrt(mean_squared_error(Y_test, Y_predicted)))
# MSE: 2017904593.23
# RMSE: 44921.092965684235

Mean Absolute Error

Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms.

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(Y_test, Y_predicted))
#MAE: 26745.1109986

(Back to top)

Multiple Linear Regression

Assumptions of Linear Regression:

Before choosing Linear Regression, need to consider below assumptions

Linearity
Homoscedasticity
Multivariate normality
Independence of errors
Lack of multicollinearity

Dummy Variables

Since State is categorical variable => we need to convert it into dummy variable
No need to include all dummy variable to our Regression Model => Only omit one dummy variable
- Why ? dummy variable trap

Understanding P value

Ho : Null Hypothesis (Universe)
H1 : Alternative Hypothesis (Universe)
For example:
- Assume Null Hypothesis is true (or we are living in Null Universe)

(Back to top)

Building a Model

5 methods of Building Models

Method 1: All-in

Throw in all variables in the dataset
Usage:
- Prior knowledge about this problem; OR
- You have to (Company Framework required)
- Prepare for Backward Elimination

Method 2 [Stepwise Regression]: Backward Elimination (Fastest)

Step 1: Select a significance level (SL) to stay in the model (e.g: SL = 0.05)

# Building the optimal model using Backward Elimination
import statsmodels.api as sm

# Avoiding the Dummy Variable Trap by excluding the first column of Dummy Variable
# Note: in general you don't have to remove manually a dummy variable column because Scikit-Learn takes care of it.
X = X[:, 1:]

#Append full column of "1"s to First Column of X using np.append
#Since y = b0*(1) + b1 * x1 + b2 * x2 + .. + bn * xn, b0 is constant and can be re-written as b0 * (1)
#np.append(arr = the array will add to, values = column to be added, axis = row/column)
# np.ones((row, column)).astype(int) => .astype(int) to convert array of 1 into integer type to avoid data type error
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

#Initialize X_opt with Original X by including all the column from #0 to #5
X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float) 
#If you are using the google colab to write your code, 
# the datatype of all the features is not set to float hence this step is important: X_opt = np.array(X[:, [0, 1, 2, 3, 4, 5]], dtype=float)

Step 2: Fit the full model with all possible predictors

#OrdinaryLeastSquares
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

Step 3: Consider Predictor with Highest P-value
- If P > SL, go to Step 4, otherwise go to [FIN : Your Model Is Ready]
Step 4: Remove the predictor

#Remove column = 2 from X_opt since Column 2 has Highest P value (0.99) and > SL (0.05).
X_opt = np.array(X[:, [0, 1, 3, 4, 5]], dtype=float) 
#OrdinaryLeastSquares
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

Step 5: Re-Fit model without this variable

Method 3 [Stepwise Regression]: Forward Selection

Step 1: Select a significance level (SL) to enter in the model (e.g: SL = 0.05)
Step 2: Fit all simple regression models (y ~ xn). Select the one with Lowest P-value for the independent variable.
Step 3: Keep this variable and fit all possible regression models with one extra predictor added to the one(s) you already have.
Step 4: Consider the predicotr with Lowest P-value. If P < SL (i.e: model is good), go STEP 3 (to add 3rd variable into the model and so on with all variables we have left), otherwise go to [FIN : Keep the previous model]

Method 4 [Stepwise Regression]: Bidirectional Elemination

Step 1: Select a significant level to enter and to stay in the model: e.g: SLENTER = 0.05, SLSTAY = 0.05
Step 2: Perform the next step of Forward Selection (new variables must have: P < SLENTER to enter)
Step 3: Perform ALL steps of Backward Elimination (old variables must have P < SLSTAY to stay) => Step 2.
Step 4: No variables can enter and no old variables can exit => [FIN : Your Model Is Ready]

Method 5: Score Comparison

Step 1: Select a criterion of goodness of ift (e.g Akaike criterion)
Step 2: Construct all possible regression Models: 2^(N) - 1 total combinations, where N: total number of variables
Step 3: Select the one with best criterion => [FIN : Your Model Is Ready]

Code Implementation

Note: Backward Elimination is irrelevant in Python, because the Scikit-Learn library automatically takes care of selecting the statistically significant features when training the model to make accurate predictions.

Step 1: Splitting the dataset into the Training set and Test set

#no need Feature Scaling (FS) for Multi-Regression Model: y = b0 + b1 * x1 + b2 * x2 + b3 * x3, 
# since we have the coefficients (b0, b1, b2, b3) to compensate, so there is no need FS.
from sklearn.model_selection import train_test_split

# NOT have to remove manually a dummy variable column because Scikit-Learn takes care of it.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Step 2: Training the Multiple Linear Regression model on the Training set

#LinearRegression will take care "Dummy variable trap" & feature selection
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

Step 3: Predicting the Test set results

y_pred = regressor.predict(X_test)

Step 4: Displaying Y_Pred vs Y_test

Since this is multiple linear regression, so can not visualize by drawing the graph

#To display the y_pred vs y_test vectors side by side
np.set_printoptions(precision=2) #To round up value to 2 decimal places

#np.concatenate((tuple of rows/columns you want to concatenate), axis = 0 for rows and 1 for columns)
#y_pred.reshape(len(y_pred),1) : to convert y_pred to column vector by using .reshape()

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1))

Step 5: Getting the final linear regression equation with the values of the coefficients

print(regressor.coef_)
print(regressor.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924853204

Equation: Profit = 86.6 x DummyState1 - 873 x DummyState2 + 786 x DummyState3 - 0.773 x R&D Spend + 0.0329 x Administration + 0.0366 x Marketing Spend + 42467.53

(Back to top)

Polynomial Linear Regression

Polynomial Linear Regression: y = b0 + b1*x1 + b2*x1^(2) + ... + bn*x1^(n)
Used for dataset with non-linear relation, but polynomial linear relation like salary scale.

(Back to top)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P02_Regression.md

P02_Regression.md

Regression

Table of contents

Introduction to Regressions

Simple Linear Regression

Outline Building a Model

Creating a Model

Predicting a Test Result

Predict a single value

Visualising the Test set results

Getting Linear Regression Equation

Evaluating the Algorithm

R Square or Adjusted R Square

R Square: Coefficient of determination

Adjusted R Square

Calculate R Square and Adjusted R Square using Python

Mean Square Error and Root Mean Square Error

Mean Absolute Error

Multiple Linear Regression

Assumptions of Linear Regression:

Dummy Variables

Understanding P value

Building a Model

Method 1: All-in

Method 2 [Stepwise Regression]: Backward Elimination (Fastest)

Method 3 [Stepwise Regression]: Forward Selection

Method 4 [Stepwise Regression]: Bidirectional Elemination

Method 5: Score Comparison

Code Implementation

Step 1: Splitting the dataset into the Training set and Test set

Step 2: Training the Multiple Linear Regression model on the Training set

Step 3: Predicting the Test set results

Step 4: Displaying Y_Pred vs Y_test

Step 5: Getting the final linear regression equation with the values of the coefficients

Polynomial Linear Regression

Files

P02_Regression.md

Latest commit

History

P02_Regression.md

File metadata and controls

Regression

Table of contents

Introduction to Regressions

Simple Linear Regression

Outline Building a Model

Creating a Model

Predicting a Test Result

Predict a single value

Visualising the Test set results

Getting Linear Regression Equation

Evaluating the Algorithm

R Square or Adjusted R Square

R Square: Coefficient of determination

Adjusted R Square

Calculate R Square and Adjusted R Square using Python

Mean Square Error and Root Mean Square Error

Mean Absolute Error

Multiple Linear Regression

Assumptions of Linear Regression:

Dummy Variables

Understanding P value

Building a Model

Method 1: All-in

Method 2 [Stepwise Regression]: Backward Elimination (Fastest)

Method 3 [Stepwise Regression]: Forward Selection

Method 4 [Stepwise Regression]: Bidirectional Elemination

Method 5: Score Comparison

Code Implementation

Step 1: Splitting the dataset into the Training set and Test set

Step 2: Training the Multiple Linear Regression model on the Training set

Step 3: Predicting the Test set results

Step 4: Displaying Y_Pred vs Y_test

Step 5: Getting the final linear regression equation with the values of the coefficients

Polynomial Linear Regression