Merge pull request #3 from JeremyBrent/jb/price_prediction_model

Jb/price prediction model
JeremyBrent · Sep 27, 2024 · 07793c3 · 07793c3
2 parents 154f906 + b9e4a82
commit 07793c3
Show file tree

Hide file tree

Showing 12 changed files with 763 additions and 120 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,3 @@
 venv
 .idea
+__pycache__
diff --git a/README.md b/README.md
@@ -18,11 +18,11 @@ I didn't require approvers on the branch protection rule due to the fact that th
 to review my code .... this would not be the case in a production environment and that would be 
 a rule in said production environment.
 
-With more time, some things I would build upon would be: 
-1. Added a comprehensive logging functionality, this is critical to production-worthy code
-2. Expanding unittest portfolio would need to build out
-3. Further developing the Github actions if we were deploying this model as a service
-
+With more time, I would: 
+1. Add a comprehensive logging functionality, this is critical to production-worthy code
+2. Expand the unittest portfolio
+3. Further develop the Github actions if we were deploying this model as a service
+4. Complete any todos noted throughout the codebase
 
 # FSA
 
@@ -37,19 +37,17 @@ a future user to need to obtain in order to run this code base
 ### Future Directions
 Ground truth data should be augmented with datasets found
 [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-2). 
-Most notably, Financial PhraseBank is one primary dataset for financial area 
+Most notably, Financial PhraseBank is one primary datasets used for financial 
 sentiment analysis ([Ding et al., 2022](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-15); 
 [Ye, Lin & Ren, 2021](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-50)), 
 which was created by [Malo et al. (2014)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-33). 
-Financial PhraseBank contains 4,845 news sentences found on the LexisNexis database and marked 
+Financial PhraseBank contains 4,845 news sentences found on the LexisNexis database and are annotated 
 by 16 people with finance backgrounds. Annotators were required to label the sentence as positive, 
 negative, or neutral market sentiment 
 [Malo et al. (2014)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-33). 
-All 4845 sentences were kept with higher than 50% agreement
-
-It is also critical that we further analyze the ground truth data to assert that it is accurate.
+The 4845 sentences which exist in the dataset had higher than 50% inter-annotator agreement.
 
-To construct a more robust system, it's critical that we move away for csv files in Github
+To construct a more robust system, it's critical that we move away from csv files in Github
 to a database. I contemplated implementing a local postgres DB to store the ground truth data,
 but determined that that would be out of scope of this project.
 
@@ -59,12 +57,12 @@ Models tested were derived from [this literature review](https://dl.acm.org/doi/
 For example, FinBert was directly mentioned [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4-5)
 and VADER was discussed [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4-4). Finbert and Roberta 
 were two of the top performing models discussed in this literature review [Du et al. (2024)](https://dl.acm.org/doi/10.1145/3649451#tab3), 
-and used as a top performer in this research [Xiao et al. (2023)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/). 
+and Finbert was used as a top performer in this research [Xiao et al. (2023)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/). 
 
 
 ### Run
-Running experiments for choosing the most accurate FSA model on the ground truth data, defined 
-[here](#data), can be triggered using the following: 
+In order to run experiments to determine the most accurate FSA model on the ground truth data, defined 
+[here](#data), run the following: 
 ```python
 from src.experiment import Experiment
 
@@ -84,8 +82,82 @@ More FSA models can be experimented on. To include more models in the `Experimen
 add the model to `experimenter.models` and any new key-value pairs that are needed to run 
 inference with the new model. 
 
-Any new models should be replicated based on existing research found 
+Any new models should be replicated based on existing research. Some of these models can be found 
 [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4).
 
 We should also implement a more sophisticated metric for 
 measuring the performance of the FSA models. Currently, we are only using a raw accuracy.
+
+# Price Prediction Model (PP)
+
+## Model
+
+### Description
+The Price Prediction model is trained to perform a binary classification to determine if 
+price will end higher or lower for the given day. 
+
+### Performance
+<p id="suspect-data">Our highest performing model was a RandomForestClassifier with a test accuracy score around 72%. 
+A pretty decent score consdering the scope of this project. However, this model performed 
+significantly better on the test set, almost 20% better, this can be seen in 
+`./experiments/experiments.csv`, which is suspect ... This 
+will need to be investigated for data leakage, changes in data distributions between the test 
+set and the train set, etc.</p>
+
+### Features
+The current features of the model, and I talk more about feature extraction [below](#feature-extract),
+include a [50 and 200 day Simple Moving Average](https://www.investopedia.com/terms/s/sma.asp),
+[Relative Strength Indicator](https://www.investopedia.com/terms/r/rsi.asp), 
+[On Balance Volume](https://www.investopedia.com/terms/o/onbalancevolume.asp), 
+[Bottom, Middle & Lower Bollinger Bands](https://www.investopedia.com/terms/b/bollingerbands.asp) and 
+a Normalized Sentiment Score of News data. 
+
+The Norm Sentiment Score equals
+`sentiment / (original date - effective date + 1)` where original date is the date at which 
+the news was published and the effective date is the theoretical date when this news will next 
+effect the market. Sentiment is a value between `-1.0` and `1.0` depending on the category (negative or positive) 
+and is normalized to be a value between `-.05` and `.05` if the most salient cateogry is neutral. 
+Calculation of the effective date can be found at `src.Utils.convert_datetime()` and extraction of 
+the sentiment can be found at `src.Model._fsa_extract_results()`
+
+### Run
+
+#### Inference
+TODO: ADD SECTION HERE
+
+#### Experiment
+To run experiments to get the best model performing Price Prediction Model, run the following code: 
+```python
+from src.experiment import Experiment
+exp = Experiment()
+exp.pp_experiment(ticker='AAPL', period='5y')
+```
+The code above, with perform a grid search for hyperparameter tuning over various models, get the 
+best model with the best hyperparameters and save the model to disk.
+
+### Future Directions
+
+1. <p id="feature-extract">We will need to run more through experimentation on our features to determine if any need to
+added or removed. </p>
+- Some things that need to be determined are correlations between features. For 
+testing numerical features, Pearson Correlation Coefficient or Spearman or Kendall Correlation 
+(for Non-linear Relationships) can be used. For categorical data, a Chi Square test can be used. 
+Tree based models are less sensitive to Multicollinearity compared to a logistic regression, but we 
+should still have a sense of the distributions of our training data. Multicollienarity can 
+result in unstable coefficients, where changes to the correlated features can have significant 
+impacts on model performance, or over-fitting, where the model simply learns the same pattern from 
+many of the features.
+- To add more or change features in our model, you will need to update the following code:
+`src.Data._price_feature_extraction()`, `src.Data._news_feature_extraction()` 
+and `src.Model.pp_extract_features()`
+
+2. We should test more models, the current solution uses no neural network architectures.
+- To add more models to our experimentation class, simply add the model and respective key-value pairs
+to `exp.pp_models`
+
+3. Investigate significantly higher performance in RandomForest test set compared to train set 
+[mentioned above](#suspect-data)
+
+4. We had issues running GridSearch on XGBoost and LGBoost where we getting the error: 
+`Process finished with exit code 139 (interrupted by signal 11: SIGSEGV).` This would need to be more 
+thoroughly debugged. These models are currently commented out in `exp.pp_models` due to this.
diff --git a/data/AAPL_news_training_data.pkl b/data/AAPL_news_training_data.pkl
diff --git a/experiments/experiments.csv b/experiments/experiments.csv
@@ -0,0 +1,16 @@
+,date,model,train_accuracy,test_accuracy,params
+0,2024-09-27 01:36:52,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
+0,2024-09-27 01:36:52,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
+0,2024-09-27 01:36:52,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
+,date,model,train_accuracy,test_accuracy,params
+0,2024-09-27 01:39:31,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
+0,2024-09-27 01:39:31,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
+0,2024-09-27 01:39:31,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
+,date,model,train_accuracy,test_accuracy,params
+0,2024-09-27 01:42:04,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
+0,2024-09-27 01:42:04,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
+0,2024-09-27 01:42:04,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
+,date,model,train_accuracy,test_accuracy,params
+0,2024-09-27 01:43:06,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
+0,2024-09-27 01:43:06,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
+0,2024-09-27 01:43:06,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
diff --git a/models/RandomForestClassifier.pkl b/models/RandomForestClassifier.pkl
diff --git a/requirements.txt b/requirements.txt
@@ -1,9 +1,13 @@
 yfinance==0.2.3
-nltk
-textblob
+nltk==3.9.1
+textblob==0.18.0.post0
 transformers
 torch==2.2.2
 scipy==1.13.1
 numpy==1.26.4
 ipython==8.18.1
-pandas_ta==0.3.14b
+pandas_ta==0.3.14b
+urllib3==1.25.11
+scikit-learn==1.5.2
+xgboost==2.1.1
+lightgbm==4.5.0
diff --git a/src/consts.py b/src/consts.py
@@ -5,3 +5,5 @@
 CPU_COUNT: int = 8
 PARALLEL_CHUNK_SIZE: int = 5
 DATE_FORMAT: str = '%Y-%m-%d %H:%M:%S'
+
+RANDOM_STATE: int = 42