-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Require some neutral network construction guide to use Asym #671
Comments
@svjack Yes, some documentation will follow soon. Still have to figure out what the best construction and initialization for asym. models. Until then, an example is in the release notes. |
Hi Nils, I used the Asym model based on the example in the release notes. I can observe that for my custom task EmbeddingSimilarityEvaluator performance (Cosine similarity Spearman) went from .7348 to .0136. Wondering if you observed such drastic performance drop due to this switch on any of your tasks? No other changes in anything - dataset, training regime etc. I used a custom pretrained xlm-roberta-large with mean pooling and finetuned it along with the added layer(s).
dense_model_A = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=output_embed_dim, activation_function=nn.Tanh())
Thanks. |
Hi @LeenaShekhar Both models share the same transformer network in that example. Hence, at the start, if a sentence for 'QRY' and 'DOC' is identical, they are mapped to the same point in vector space. However, as the dense layers are different, they are then moved to completely different locations in vector space. Backpropagation updates not only the dense layer, but also the transformer layers. So the optimizer fails to find a nice configuration for a shared transformer layer but different dense layers. Here are some options how to solve it:
|
Thanks Nils. These points make sense; I was thinking all the same lines and am doing expt similar to 1 and 3. For 2 where you talked about using two independent transformers, I did the code change as you mentioned but am getting the following error:
Let me know if you have an idea why this is happening; I will look at the code in details too. |
@LeenaShekhar I added this method to the model, but it is not yet part of the pypi release. You can install the package here from sources |
Thank you so much. I have tested it, and it seems to work now. |
Did this work for training in the end? I tried out the functionality for encoding and querying, however did not manage to train a model based on the example provided got the following error: Is there any example already available/ any tips to train the models in asym fashion? |
Looks like you pass an InputExample at the wrong place. For encoding, you don't have to wrap it into an InputExample, just pass a dict in the format {'doc': 'your text'} |
This is not an issue, encoding works very well. |
Instead of strings, you have to use dicts in the InputExamples |
To get this straight: the model is comparing 'my first sentence' to 'my second sentence'? It learns that the correlation between the two is 80%? |
Yes
Not the correlation but the cosine similarity. |
What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem? |
I am trying to use the model to tackle an entity-resolution problem - my goal is to fine-tune the model on a very large dataset, with the goal of making it familiar with the structure of the data (which is a concatenation of numerous text columns) and then, extract embedding vectors from the entire dataset and find those records with the highest cosine similarity score, and group them together as one single entity. Is that approach feasible with sentence-transformers or am I getting it entirely wrong? My concern is that, because it relies on the semantics, if I concatenate large sentences together, the model will lose its power. |
You say “ concatenation of numerous text columns“ |
Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models. We will soon release some code that allows to train without labels. The improvement depends extremely on the domain. |
i hope this may not in the sense of use crossencoder to generate labels. |
Which unsupervised method you will use ? |
Here are two papers on unsupervised sentence embeddings learning: |
Sparse vector features such as bm25 vector or tfidf vector seems can not used for search directly. |
Bm25 and tf idf are perfect for Search and are quite hard to beat in the general setup over all possible use cases. |
If my understanding is correct, can the asym model class be used to train a DPR model (different query/context encoder) using the MNRL loss (with different transformer backbones and without a dense model after the pooling layer)? |
@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR |
Are you aware of any references for some existing work on it on the top of your mind? |
Tried it in this paper & found asym to be worse: |
So what I'm understanding from this is that, although the DPR examples in the documentation use separate context and question encoders (facebook-dpr-ctx_encoder-single-nq-base and facebook-dpr-question_encoder-single-nq-base), that if we wanted to fine tune a model for that purpose we'd be better off using a single model. Maybe a Multi-QA model would be a good base model to start from, using the MNRL loss. Is this correct? If that's the case, I'm looking for a guide to help me do that |
I think the documentation should have a section discuss different construction instance about Asym in different dataset and tasks.
You can only release some paper or guidelines materials about the different constructions.
The text was updated successfully, but these errors were encountered: