Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readme #5

Merged
merged 2 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,9 @@ and VADER was discussed [here](https://dl.acm.org/doi/10.1145/3649451#sec-4-4-4)
were two of the top performing models discussed in this literature review [Du et al. (2024)](https://dl.acm.org/doi/10.1145/3649451#tab3),
and Finbert was used as a top performer in this research [Xiao et al. (2023)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10403218/).

### Performance
On a small test set, FinBert achieved 74% accuracy, Roberta achieved 62% accuracy,
FinRoberta achieved 70% accuracy, TextBlob achieved 52% accuracy, and NLTK achieved 47% accuracy.

### Run
In order to run experiments to determine the most accurate FSA model on the ground truth data, defined
Expand Down Expand Up @@ -97,12 +100,11 @@ The Price Prediction model is trained to perform a binary classification to dete
price will end higher or lower for the given day.

### Performance
<p id="suspect-data">Our highest performing model was a RandomForestClassifier with a test accuracy score around 72%.
Our highest performing model was a RandomForestClassifier with a test accuracy score around 72%.
A pretty decent score consdering the scope of this project. However, this model performed
significantly better on the test set, almost 20% better, this can be seen in
`./experiments/experiments.csv`, which is suspect ... This
significantly better on the test set, almost 20% better, this can be seen in `./experiments/experiments.csv`, which is suspect ... This
will need to be investigated for data leakage, changes in data distributions between the test
set and the train set, etc.</p>
set and the train set, etc.

### Features
The current features of the model, and I talk more about feature extraction [below](#feature-extract),
Expand Down Expand Up @@ -156,7 +158,7 @@ and `src.Model.pp_extract_features()`
to `exp.pp_models`

3. Investigate significantly higher performance in RandomForest test set compared to train set
[mentioned above](#suspect-data)
[mentioned above](#performance)

4. We had issues running GridSearch on XGBoost and LGBoost where we getting the error:
`Process finished with exit code 139 (interrupted by signal 11: SIGSEGV).` This would need to be more
Expand Down
1 change: 1 addition & 0 deletions src/experiment.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def __init__(self):
'LogisitcRegression': {
'model': LogisticRegression(random_state=RANDOM_STATE),
'params': {
# TODO: add more comprehensive grid search params
"C": np.logspace(-3, 3, 7),
"penalty": ["l1", "l2"]
},
Expand Down