Skip to content

Demo of linear regression without complex validations added to it. DISCLAIMER: This script is for templating propouses and should not be used in production without the proper statistical knowledge.

Notifications You must be signed in to change notification settings

hriva/sample-linear-regression

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Description

Small Example of a simple linear regression script.

DISCLAIMER: This script is for demonstrative propouses only, it should not be used in production without the proper analysis of the variables to predict, including but not limited to their autocorrelation/multicollinearity, correlation and or seasonality.

Install

This project requires
Python. It is recommended to create a virtual enviorment to use it.

git clone --depth 1 https://github.com/hriva/sample-linear-regression.git 
cd sample-linear-regression

# Create virt env
virtualenv -p python3 sample-linear-regression
source sample-linear-regression/bin/activate
cd sample-linear-regression
pip3 install -r requirements.txt

# Run
chmod +x src/bitcoin-ethereum.py
./src/bitcoin-ethereum.py

Explanation

Linear Regression

Small linear regression sample implementation. For this example we are using the Bitcoin Prices in a Monthly basis as the dependant variable. i.e., the variable we want to predict. And we use the Ethereum Prices in a Montly basis as the independant variable. i.e., the variable we are using to predict the Bitcoin price.

DISCLAIMER: this is simplified given that yahoo offers good quality data. Usually you need more steps to clean and wrangle data.

1. Import libraries

import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

2. Load the data from csv files.

# Ingest
## Data paths.
DATA = "src/assets/BTC-USD.csv" 
DATA1 = "src/assets/ETH-USD.csv"

## Make pandas read the data.
df1 = pd.read_csv(DATA, index_col="Date", parse_dates=True).sort_index(ascending=True)
df2 = pd.read_csv(DATA1, index_col="Date", parse_dates=True).sort_index(ascending=True)

Ingests the data fetched from Yahoo Finance (The data has no blank values). During the import, the data is formated to a time series by setting the Dates as the index. The index is then sorted as ascend given that for linear regresions these need to be from older to newest.

df1.head()
df2.head()
Open High Low Close Adj Close Volume
Date
2019-02-01 107.147682 165.549622 102.934563 136.746246 136.746246 101430995445
2019-03-01 136.836243 149.613235 125.402702 141.514099 141.514099 138882123600
2019-04-01 141.465485 184.377853 140.737564 162.166031 162.166031 204556824026
2019-05-01 162.186554 287.201630 159.660217 268.113556 268.113556 314349041886
2019-06-01 268.433350 361.398682 229.257431 290.695984 290.695984 270589672710

3. Preprocess the data for the regression.

## Get the correlation
df1["Adj Close"].corr(df2["Adj Close"])
0.9187847614681434
## Get necesary predictive values only.
# Use double brackets to avoid sending pandas.core.series.Series instead of DataFrame
dfx = df2[["Adj Close"]]  # Load the Etherum price as x (Independant)
dfy = df1[["Adj Close"]]  # Load the Bitcoin price as y (Dependant)

Viewing the data shows the varying measures for the exchange prices. We need to drop all of them except the "Adj Close".

Notice that unlike the correlation. We fetch the columns using double brackets. This is to avoid getting errors in scikit-learn given that we are using 1 variable as preditor instead of a multy plexed array (a matrix).

4. Split the sets

train_size = 0.8  # use 80 percent to train the regression
if dfx.shape[0] != dfy.shape[0]:
    print("Sample Sizes ERROR")
    exit
x_train_size = round(dfx.shape[0] * train_size)  # We only need the rows.
x_test_size = x_train_size
y_train_size = round(dfx.shape[0] * train_size)  # We only need the rows.
y_test_size = y_train_size

x_train, x_test = dfx.iloc[:x_train_size], dfx.iloc[x_test_size:]
y_train, y_test = dfy.iloc[:y_train_size], dfy.iloc[y_test_size:]

Why not use all the series for the regression? To avoid overfitting.

5. Regression

# Regression
regressor = LinearRegression()
regressor.fit(X=x_train, y=y_train)

Create a Linear Regression instance and then fit it to the linear regression we need.
x is the Ethereum price.
y is the Bitcoin price.

6. Predict

y_pred = regressor.predict(x_test)
print(y_pred)
[[29754.22137026]
 [30433.23886914]
 [30398.68553649]
 [31129.1065507 ]
 [30176.4599037 ]
 [27572.75954253]
 [27888.41823535]
 [29685.41814795]
 [32605.41128462]
 [35436.57757274]
 [35799.63595354]
 [34648.96926019]]

Create an array with the predictions for the test (Validation) set.
Print the predictions.

7. Plot.

# Training Sets
plt.scatter(x_train, y_train, color="red")
plt.plot(x_train, regressor.predict(x_train), color="blue")
plt.title("Bitcoin vs Ethereum")
plt.xlabel("Ethereum Closing Price")
plt.ylabel("Bitcoin Closing Price")
plt.show()


# Test set
plt.scatter(x_test, y_test, color="red")
plt.plot(x_train, regressor.predict(x_train), color="blue")
plt.title("Bitcoin vs Ethereum")
plt.xlabel("Ethereum Closing Price")
plt.ylabel("Bitcoin Closing Price")
plt.show()

png

png

About

Demo of linear regression without complex validations added to it. DISCLAIMER: This script is for templating propouses and should not be used in production without the proper statistical knowledge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published