The data is gathered from the yfinance library. A total of 1258 dates of open price, high price, low price, close price, volume of stocks, dividends, and stock splits. The dividends and stock splits columns only had values of 0, so these two columns were dropped.
Plotting all of the data in a candlestick chart, there was a general trend of increasing open, high, low and close prices. The lowest price was a low price on January 20, 2016 at $1812.29 and the highest price was a high price on December 27, 2019 at $3,247.93.
The boxplot comparison of open, high, low and close prices across the five years shows that the each year is distinct across the four prices. 2015 and 2016 years are relatively similar across the four prices. From 2017 to 2019 is where there is a stronger trend of increasing prices.The spread of prices for 2019 is quite wide compare to the other years and more centered.
The distribution of volume of stocks is slightly skewed to the right, with a majority of volume of stocks traded is around 3.5 billion stocks. The volume of stocks traded at the end of year in November or December experience a wider range of volume of stocks traded. The lowess or Locally Weighted Scatterplot Smoothing line shows a weak general trend across years that the most stocks are traded at the beginning and the end of the year.
For the years from 2015-2019, there seems to be a linear relationship among the three prices.
All input data, open, high, and low prices, and output data, close prices, were standardized using the MinMaxScaler with range 0 to 1. Using the function from https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/ to transform the data to be used in RNN as batch size, time steps, and input size. The train shape was (956, 90, 3) and the test shape was (202, 90, 3).The train data was taken from the first 80% the data, and the remaining 20% of the data was used for the test dataset. A validation data set was designateted to be 20% of the training data.
All three RNN models were run with an EarlyStopping(monitor='loss') with 50 epochs. compiled with the adam optimizer and mse loss metric. The rnn1 model only has 3 SimpleRNN layers with 40 hidden neurons. rnn2 model has 4 LSTM layers with 3 hidden neurons each and one SimpleRNN layer. rnn3 model has two LSTM layers with relu activation and 100 hidden neuron, plus one dense layer.
Predicting the future close prices or any other stock prices is helpful in order to gauge whether to hold onto stocks or sell stocks. Determining the appropriate time step or days to include with each sample is key part in creating a better RNN model. The ideal timestep depends on the breadth of training data and
None of the models had very promising MSE scores. I choose to evaluate the model on 50 time step which may have negatively impacted the models ability to predict day to day prices. I would also try to increase the number of epochs run in order to bring down the MSE loss function.
model | train/test | MSE |
---|---|---|
rnn1 | train | 1120.32 |
rnn1 | test | 2541.250 |
rnn2 | train | 1547.56 |
rnn2 | test | 3992.39 |
rnn3 | train | 859.57 |
rnn3 | tesst | 2111.09 |