Skip to content

Commit

Permalink
Merge pull request #3 from JeremyBrent/jb/price_prediction_model
Browse files Browse the repository at this point in the history
Jb/price prediction model
  • Loading branch information
JeremyBrent authored Sep 27, 2024
2 parents 154f906 + b9e4a82 commit 07793c3
Show file tree
Hide file tree
Showing 12 changed files with 763 additions and 120 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
venv
.idea
__pycache__
102 changes: 87 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ I didn't require approvers on the branch protection rule due to the fact that th
to review my code .... this would not be the case in a production environment and that would be
a rule in said production environment.

With more time, some things I would build upon would be:
1. Added a comprehensive logging functionality, this is critical to production-worthy code
2. Expanding unittest portfolio would need to build out
3. Further developing the Github actions if we were deploying this model as a service

With more time, I would:
1. Add a comprehensive logging functionality, this is critical to production-worthy code
2. Expand the unittest portfolio
3. Further develop the Github actions if we were deploying this model as a service
4. Complete any todos noted throughout the codebase

# FSA

Expand All @@ -37,19 +37,17 @@ a future user to need to obtain in order to run this code base
### Future Directions
Ground truth data should be augmented with datasets found
[here](https://dl.acm.org/doi/10.1145/3649451#sec-4-2).
Most notably, Financial PhraseBank is one primary dataset for financial area
Most notably, Financial PhraseBank is one primary datasets used for financial
sentiment analysis ([Ding et al., 2022](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-15);
[Ye, Lin & Ren, 2021](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-50)),
which was created by [Malo et al. (2014)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-33).
Financial PhraseBank contains 4,845 news sentences found on the LexisNexis database and marked
Financial PhraseBank contains 4,845 news sentences found on the LexisNexis database and are annotated
by 16 people with finance backgrounds. Annotators were required to label the sentence as positive,
negative, or neutral market sentiment
[Malo et al. (2014)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/#ref-33).
All 4845 sentences were kept with higher than 50% agreement

It is also critical that we further analyze the ground truth data to assert that it is accurate.
The 4845 sentences which exist in the dataset had higher than 50% inter-annotator agreement.

To construct a more robust system, it's critical that we move away for csv files in Github
To construct a more robust system, it's critical that we move away from csv files in Github
to a database. I contemplated implementing a local postgres DB to store the ground truth data,
but determined that that would be out of scope of this project.

Expand All @@ -59,12 +57,12 @@ Models tested were derived from [this literature review](https://dl.acm.org/doi/
For example, FinBert was directly mentioned [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4-5)
and VADER was discussed [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4-4). Finbert and Roberta
were two of the top performing models discussed in this literature review [Du et al. (2024)](https://dl.acm.org/doi/10.1145/3649451#tab3),
and used as a top performer in this research [Xiao et al. (2023)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/).
and Finbert was used as a top performer in this research [Xiao et al. (2023)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/).


### Run
Running experiments for choosing the most accurate FSA model on the ground truth data, defined
[here](#data), can be triggered using the following:
In order to run experiments to determine the most accurate FSA model on the ground truth data, defined
[here](#data), run the following:
```python
from src.experiment import Experiment

Expand All @@ -84,8 +82,82 @@ More FSA models can be experimented on. To include more models in the `Experimen
add the model to `experimenter.models` and any new key-value pairs that are needed to run
inference with the new model.

Any new models should be replicated based on existing research found
Any new models should be replicated based on existing research. Some of these models can be found
[here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4).

We should also implement a more sophisticated metric for
measuring the performance of the FSA models. Currently, we are only using a raw accuracy.

# Price Prediction Model (PP)

## Model

### Description
The Price Prediction model is trained to perform a binary classification to determine if
price will end higher or lower for the given day.

### Performance
<p id="suspect-data">Our highest performing model was a RandomForestClassifier with a test accuracy score around 72%.
A pretty decent score consdering the scope of this project. However, this model performed
significantly better on the test set, almost 20% better, this can be seen in
`./experiments/experiments.csv`, which is suspect ... This
will need to be investigated for data leakage, changes in data distributions between the test
set and the train set, etc.</p>

### Features
The current features of the model, and I talk more about feature extraction [below](#feature-extract),
include a [50 and 200 day Simple Moving Average](https://www.investopedia.com/terms/s/sma.asp),
[Relative Strength Indicator](https://www.investopedia.com/terms/r/rsi.asp),
[On Balance Volume](https://www.investopedia.com/terms/o/onbalancevolume.asp),
[Bottom, Middle & Lower Bollinger Bands](https://www.investopedia.com/terms/b/bollingerbands.asp) and
a Normalized Sentiment Score of News data.

The Norm Sentiment Score equals
`sentiment / (original date - effective date + 1)` where original date is the date at which
the news was published and the effective date is the theoretical date when this news will next
effect the market. Sentiment is a value between `-1.0` and `1.0` depending on the category (negative or positive)
and is normalized to be a value between `-.05` and `.05` if the most salient cateogry is neutral.
Calculation of the effective date can be found at `src.Utils.convert_datetime()` and extraction of
the sentiment can be found at `src.Model._fsa_extract_results()`

### Run

#### Inference
TODO: ADD SECTION HERE

#### Experiment
To run experiments to get the best model performing Price Prediction Model, run the following code:
```python
from src.experiment import Experiment
exp = Experiment()
exp.pp_experiment(ticker='AAPL', period='5y')
```
The code above, with perform a grid search for hyperparameter tuning over various models, get the
best model with the best hyperparameters and save the model to disk.

### Future Directions

1. <p id="feature-extract">We will need to run more through experimentation on our features to determine if any need to
added or removed. </p>
- Some things that need to be determined are correlations between features. For
testing numerical features, Pearson Correlation Coefficient or Spearman or Kendall Correlation
(for Non-linear Relationships) can be used. For categorical data, a Chi Square test can be used.
Tree based models are less sensitive to Multicollinearity compared to a logistic regression, but we
should still have a sense of the distributions of our training data. Multicollienarity can
result in unstable coefficients, where changes to the correlated features can have significant
impacts on model performance, or over-fitting, where the model simply learns the same pattern from
many of the features.
- To add more or change features in our model, you will need to update the following code:
`src.Data._price_feature_extraction()`, `src.Data._news_feature_extraction()`
and `src.Model.pp_extract_features()`

2. We should test more models, the current solution uses no neural network architectures.
- To add more models to our experimentation class, simply add the model and respective key-value pairs
to `exp.pp_models`

3. Investigate significantly higher performance in RandomForest test set compared to train set
[mentioned above](#suspect-data)

4. We had issues running GridSearch on XGBoost and LGBoost where we getting the error:
`Process finished with exit code 139 (interrupted by signal 11: SIGSEGV).` This would need to be more
thoroughly debugged. These models are currently commented out in `exp.pp_models` due to this.
Binary file added data/AAPL_news_training_data.pkl
Binary file not shown.
16 changes: 16 additions & 0 deletions experiments/experiments.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
,date,model,train_accuracy,test_accuracy,params
0,2024-09-27 01:36:52,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
0,2024-09-27 01:36:52,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
0,2024-09-27 01:36:52,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
,date,model,train_accuracy,test_accuracy,params
0,2024-09-27 01:39:31,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
0,2024-09-27 01:39:31,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
0,2024-09-27 01:39:31,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
,date,model,train_accuracy,test_accuracy,params
0,2024-09-27 01:42:04,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
0,2024-09-27 01:42:04,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
0,2024-09-27 01:42:04,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
,date,model,train_accuracy,test_accuracy,params
0,2024-09-27 01:43:06,LogisitcRegression,0.5476190476190477,0.5454545454545454,"{'C': 0.001, 'penalty': 'l2'}"
0,2024-09-27 01:43:06,RandomForestClassifier,0.5238095238095237,0.7272727272727273,"{'bootstrap': True, 'max_depth': None, 'max_features': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}"
0,2024-09-27 01:43:06,GradientBoostingClassifier,0.5238095238095237,0.5454545454545454,"{'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50, 'subsample': 0.8}"
Binary file added models/RandomForestClassifier.pkl
Binary file not shown.
10 changes: 7 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
yfinance==0.2.3
nltk
textblob
nltk==3.9.1
textblob==0.18.0.post0
transformers
torch==2.2.2
scipy==1.13.1
numpy==1.26.4
ipython==8.18.1
pandas_ta==0.3.14b
pandas_ta==0.3.14b
urllib3==1.25.11
scikit-learn==1.5.2
xgboost==2.1.1
lightgbm==4.5.0
2 changes: 2 additions & 0 deletions src/consts.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
CPU_COUNT: int = 8
PARALLEL_CHUNK_SIZE: int = 5
DATE_FORMAT: str = '%Y-%m-%d %H:%M:%S'

RANDOM_STATE: int = 42
Loading

0 comments on commit 07793c3

Please sign in to comment.