Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add files via upload #497

Merged
merged 5 commits into from
Mar 21, 2024
Merged

Add files via upload #497

merged 5 commits into from
Mar 21, 2024

Conversation

MosheWasserb
Copy link
Collaborator

@MosheWasserb MosheWasserb commented Mar 5, 2024

Adding a new notebook demonstrates Zero cost Zero time Zero shot Financial Sentiment Analysis
From GPT4/Mixtral to MLP128K with SetFit
@tomaarsen Could you also send to Moritz Laurer for review?

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@tomaarsen
Copy link
Member

Looks promising at a glance! cc @MoritzLaurer

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would recommend versioning the key libraries to avoid issues with breaking changes in the future


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provide a bit more context where this figure comes from and what it represents. i.e. zeroshot for the 3 generative LLMs and a fine-tuned RoBERTa based on the zeroshot synthetic data from Mixtral (CoT + SC) (1800~ data rows/texts)


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use consistent terminology: pseudo labels or synthetic data (or better: explain that pseudo labels are synthetic data)


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "download" instead of "upload"?

would also slightly reformulate to make it clear that by "skip the training step" you mean skipping the code two cells further down. (could maybe even add an if else to let people choose)


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

installs should probably all be at the very beginning


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, didn't know the WOS approach/metric. I suppose that word order is only one thing that transformers take into account. Another aspect would be semantic (dis)similarity of different strings. With a countvectorizer you only capture the exact words that are in the training corpus, but it can't capture semantically similar words that are outside the training data distribution


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #11.    print('The WOS implies that in average {:0.1f}% of the sentences in the financial sentiment analysis (FSA) dataset are rather simple.\n'.format(100-100*WOS))


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "prbabilities".

I would maybe also add a note somewhere that this is less likely to work on more complex reasoning tasks (I'd assume e.g. that countvectorizers just can't represent complex semantics / classes well enough)


Reply via ReviewNB

@@ -0,0 +1,2428 @@
{
Copy link

@MoritzLaurer MoritzLaurer Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting! (worth noting that you are also increasing the size of the MLP here, in addition to adding more training data. maybe make that (small) increase in size explicit)


Reply via ReviewNB

@MoritzLaurer
Copy link

Looks interesting and good to me. Would assume that this works less well for more complex tasks and the additional step of distilling from the setfit model takes more developer time, but overall a cool approach for further compressing the model and making things much more efficient for inference

@MosheWasserb
Copy link
Collaborator Author

Thanks @MoritzLaurer for the comments. We updated the notebook accordingly.
Would you be interested in a joint post/blog or expanding your original blog with the MLP example?

@MoritzLaurer
Copy link

Thanks @MoritzLaurer for the comments. We updated the notebook accordingly. Would you be interested in a joint post/blog or expanding your original blog with the MLP example?

Great! Don't have bandwidth for a joint blog atm unfortunately.
Notebook LGTM @tomaarsen

@MosheWasserb
Copy link
Collaborator Author

Hi @tomaarsen I think we are good to go and merge into main.
Would be great if you could also promote via LinkedIn.

@MosheWasserb
Copy link
Collaborator Author

Hi @tomaarsen Could you merge into main?

@MosheWasserb MosheWasserb self-assigned this Mar 21, 2024
@MosheWasserb MosheWasserb merged commit f97b3ad into main Mar 21, 2024
6 of 18 checks passed
@MosheWasserb MosheWasserb deleted the Sentiment_zeroshot branch March 21, 2024 08:30
@tomaarsen
Copy link
Member

@MosheWasserb My apologies for the radio silence here, I was very busy with https://github.com/UKPLab/sentence-transformers/releases/tag/v2.6.0
Very impressive performance on this work.

@MosheWasserb
Copy link
Collaborator Author

MosheWasserb commented Mar 26, 2024

Hi @tomaarsen Sure, no problem.
Great work with the binary embeddings.
Did you know that for SetFit I was able to compress a 768-vector size into 2 dim with no accuracy loss?

model after fine-tuning

X_train = model.encode(x_train)
X_eval = model.encode(x_eval)

PCA

estimator = PCA(n_components=2)
estimator.fit(X_train)

2D vectors

X_train_em = estimator.transform(X_train)
X_eval_em = estimator.transform(X_eval)

Logistic 2nd phase

sgd = LogisticRegression()
sgd.fit(X_train_em, y_train)
y_pred_eval_sgd = sgd.predict(X_eval_em)

@tomaarsen
Copy link
Member

Thank you! PCA remains strong indeed, especially for classification. It doesn't work very well for retrieval however, there I've had more luck with 1. Matryoshka models and 2. Quantization to speed up the comparisons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants