SC1015 - Mini Project

📃 About

This is our mini project for our module SC1015 Introduction to Data Science and Artificial Intelligence. The dataset we are using is S&P 500. We chose this dataset to compare the accuracy of machine learning model and statistical analysis models in predicting stock prices.

📈 Problem Statement

Will a Machine Learning model or a Statistical model be better in predicting S&P 500 index? Are we able to prove S&P 500 to be a reliable index to buy? Will predicting models justify the reliability of S&P 500?

👥 Our Team

Name	Parts Done	Github ID
Brandon Jang Jin Tian	Data Preparation, LSTM, Information Presentation	@BrandonJang
Chung Zhi Xuan	Exploratory Analysis, ARIMA, SARIMA, GitHub	@spaceman03
Tee Qin Tong Bettina	Exploratory Analysis, ARIMA, Ethical Consideration	@BettinaTee03

😎 Data Science Pipeline: Information Presentation

Predicting models can predict S&P 500 index prices to a certain extent of accuracy. We compared a machine learning model (LSTM) to statistical models (ARIMA/SARIMA).

We found that the machine learning model (LSTM) seems to give the best prediction as compared to the statistical models (ARIMA/SARIMA) in predicting S&P 500 index prices, as LSTM has the lowest Root Mean Square Error (RMSE). Between the statistical models, we also found that the ARIMA model seems to be more accurate than the SARIMA model, due to ARIMA having a lower RMSE than SARIMA. However, the presented information for ARIMA seems to be unrealistic due to the predicted values not being seasonal.

Comparing between including or excluding outliers in our dataset, generally, the models give a lower RMSE when excluding outliers in the dataset. However, it does not give us an accurate and realistic representation of the real-time index prices. The outliers identified are continuous over a common period. Hence, the outliers identified are not anomalies of the dataset and are significant values in the dataset which will influence future predictions of the index prices. Therefore, the dataset which includes the outliers is the more accurate and realistic representation of the real-time index prices.

🌐 Data Science Pipeline: Ethical Considerations

1. Possibility of Reinforced Human Bias
Predictive Model utilises past data to predict the possible results in the long run. The past data retrieved may be based on human decisions or human-led economic downturn. Therefore, the data used in the algorithm could possess some of these biases.

2. Lack of Transparency
External parties involved needs to understand how we gather, store and create the algorithm in order to utilise it or own it. It will be dishonest to them if there is a lack of publishing and might not give them a tailored experience.

3. Over Reliance on Model
People who are looking into purchasing S&P 500 index may consider our model. These people may rely on our model to purchase the S&P 500 indexes. If our model gives a prediction that is not close to the real data, it might lead to a loss for investors. Especially since the models do not take into account real life circumstances such as disease outbreaks, which might lead to an unexpected change in prices of indexes. Hence, if potential investors over rely on our model to predict future indexes, it might lead to undesirable outcomes for them.

📑 Conclusion

The dataset which includes the outliers is the more accurate and realistic representation of the real-time index prices. The presented information for ARIMA seems to be unrealistic due to the predicted values not being seasonal. We have concluded the LSTM predicting model to be the most accurate model out of the 3 models we have analysed. The LSTM Model predicts a steady upward trend for the S&P 500 index prices when the indicators, namely GDP, GNP, Real GDP, GNI, Consumer Spending and Private Domestic Investment are seen on a postive upward trend. Hence, the LSTM predicting model can justify that the S&P 500 index is a reliable index to purchase based on the data that we have analysed.

💡 What We Have Learnt

Plot interactive graphs using Plotly
Using the pmdarima library to build ARIMA and SARIMA model
The selection of relevant features or variables is essential to build accurate predictive models. Factors such as macroeconomic and commercial economies of scale indicators can be important inputs
The performance of predictive models can be impacted by the accuracy, completeness, and consistency of the data used for training, emphasizing the crucial role of data quality.

Disclaimer

Plotly graph cannot be seen in github (Have to download the jupyter notebook and re-run the file)

References

S&P 500 Historical Data: https://www.kaggle.com/datasets/henryhan117/sp-500-historical-data?select=SPX.csv
Python API for FRED: https://github.com/mortada/fredapi
Plotly: https://plotly.com/python/
LSTM Predictive Model: https://medium.com/the-handbook-of-coding-in-finance/stock-prices-prediction-using-long-short-term-memory-lstm-model-in-python-734dd1ed6827
ARIMA & SARIMA Predictive Models: https://github.com/0xpranjal/Stock-Prediction-using-different-models; https://towardsdatascience.com/time-series-forecasting-with-arima-sarima-and-sarimax-ee61099e78f6

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
B135_team3_JangChungTee_SlidesInVideoForReference.pdf		B135_team3_JangChungTee_SlidesInVideoForReference.pdf
LICENSE		LICENSE
Mini Project (With Outliers).ipynb		Mini Project (With Outliers).ipynb
Mini Project (Without Outliers).ipynb		Mini Project (Without Outliers).ipynb
README.md		README.md
SPX.csv		SPX.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015 - Mini Project

📃 About

📈 Problem Statement

👥 Our Team

😎 Data Science Pipeline: Information Presentation

🌐 Data Science Pipeline: Ethical Considerations

📑 Conclusion

💡 What We Have Learnt

Disclaimer

References

About

Releases

Packages

Contributors 3

Languages

License

spaceman03/SC1015

Folders and files

Latest commit

History

Repository files navigation

SC1015 - Mini Project

📃 About

📈 Problem Statement

👥 Our Team

😎 Data Science Pipeline: Information Presentation

🌐 Data Science Pipeline: Ethical Considerations

📑 Conclusion

💡 What We Have Learnt

Disclaimer

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages