Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require some neutral network construction guide to use Asym #671

Open
svjack opened this issue Jan 7, 2021 · 26 comments
Open

Require some neutral network construction guide to use Asym #671

svjack opened this issue Jan 7, 2021 · 26 comments

Comments

@svjack
Copy link

svjack commented Jan 7, 2021

I think the documentation should have a section discuss different construction instance about Asym in different dataset and tasks.
You can only release some paper or guidelines materials about the different constructions.

@nreimers
Copy link
Member

nreimers commented Jan 7, 2021

@svjack Yes, some documentation will follow soon. Still have to figure out what the best construction and initialization for asym. models. Until then, an example is in the release notes.

@LeenaShekhar
Copy link

LeenaShekhar commented Jan 12, 2021

Hi Nils,

I used the Asym model based on the example in the release notes. I can observe that for my custom task EmbeddingSimilarityEvaluator performance (Cosine similarity Spearman) went from .7348 to .0136. Wondering if you observed such drastic performance drop due to this switch on any of your tasks? No other changes in anything - dataset, training regime etc. I used a custom pretrained xlm-roberta-large with mean pooling and finetuned it along with the added layer(s).

dense_model_A = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=output_embed_dim, activation_function=nn.Tanh())

dense_model_B = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=output_embed_dim, activation_function=nn.Tanh())

asym_model = models.Asym({'QRY': [dense_model_A], 'DOC': [dense_model_B]})
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])

Thanks.

@nreimers
Copy link
Member

Hi @LeenaShekhar
Training the asym. models can be tricky and I don't have a recommended solution yet.

Both models share the same transformer network in that example. Hence, at the start, if a sentence for 'QRY' and 'DOC' is identical, they are mapped to the same point in vector space.

However, as the dense layers are different, they are then moved to completely different locations in vector space. Backpropagation updates not only the dense layer, but also the transformer layers. So the optimizer fails to find a nice configuration for a shared transformer layer but different dense layers.

Here are some options how to solve it:

  • Initialize dense_model_A and dense_model_B with the same weights. Hence, in the beginning, there is no difference. You can initialize it either with the same (random) weights or with a torch.eye() matrix (i.e. dense layer will not change the embedding at the beginning)
  • Instead of shared transformer layer, use two independent transformer, pooling, and dense layer:
asym_model = models.Asym({'QRY': [word_a, pooling_a, dense_model_A], 'DOC': [word_b, pooling_b, dense_model_B]}) model = SentenceTransformer(modules=[asym_model])
  • You can also try to first freeze the shared transformer layer so that the two Dense layers can learn a mapping. Then allow to also update the transformer layer.

@LeenaShekhar
Copy link

LeenaShekhar commented Jan 13, 2021

Thanks Nils. These points make sense; I was thinking all the same lines and am doing expt similar to 1 and 3. For 2 where you talked about using two independent transformers, I did the code change as you mentioned but am getting the following error:

File "/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 524, in fit
data = next(data_iterator)
File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/string/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/ads-nfs-2/string//lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 394, in smart_batching_collate
tokenized = self.tokenize(texts[idx])
File "/ads-nfs-2/string//lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 325, in tokenize
return self._first_module().tokenize(text)

File "/home/string/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 779, in getattr
type(self).name, name))
torch.nn.modules.module.ModuleAttributeError: 'Asym' object has no attribute 'tokenize'

Let me know if you have an idea why this is happening; I will look at the code in details too.

@nreimers
Copy link
Member

@LeenaShekhar
Thanks for pointing this out. The Asym model did not have a tokenize method.

I added this method to the model, but it is not yet part of the pypi release. You can install the package here from sources

@LeenaShekhar
Copy link

LeenaShekhar commented Jan 13, 2021

Thank you so much. I have tested it, and it seems to work now.

@kenkyusha
Copy link

Did this work for training in the end? I tried out the functionality for encoding and querying, however did not manage to train a model based on the example provided got the following error:
TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'sentence_transformers.readers.InputExample.InputExample'>

Is there any example already available/ any tips to train the models in asym fashion?

@nreimers
Copy link
Member

Looks like you pass an InputExample at the wrong place. For encoding, you don't have to wrap it into an InputExample, just pass a dict in the format {'doc': 'your text'}

@kenkyusha
Copy link

Looks like you pass an InputExample at the wrong place. For encoding, you don't have to wrap it into an InputExample, just pass a dict in the format {'doc': 'your text'}

This is not an issue, encoding works very well.
The problem is when I try to fit a model
I have the list of input examples
train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8), InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
Which I then pass into Dataloader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
Now the error appears, calling the fit function:
model.fit(train_objectives=[(train_dataloader)], evaluator=evaluator, epochs=num_epochs, evaluation_steps=1000, warmup_steps=warmup_steps)

@nreimers
Copy link
Member

Instead of strings, you have to use dicts in the InputExamples

@youcefjd
Copy link

To get this straight: the model is comparing 'my first sentence' to 'my second sentence'? It learns that the correlation between the two is 80%?

@nreimers
Copy link
Member

nreimers commented Mar 30, 2021

the model is comparing 'my first sentence' to 'my second sentence'?

Yes

It learns that the correlation between the two is 80%?

Not the correlation but the cosine similarity.

@youcefjd
Copy link

What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?

@youcefjd
Copy link

I am trying to use the model to tackle an entity-resolution problem - my goal is to fine-tune the model on a very large dataset, with the goal of making it familiar with the structure of the data (which is a concatenation of numerous text columns) and then, extract embedding vectors from the entire dataset and find those records with the highest cosine similarity score, and group them together as one single entity.

Is that approach feasible with sentence-transformers or am I getting it entirely wrong? My concern is that, because it relies on the semantics, if I concatenate large sentences together, the model will lose its power.

@svjack
Copy link
Author

svjack commented Mar 31, 2021

I am trying to use the model to tackle an entity-resolution problem - my goal is to fine-tune the model on a very large dataset, with the goal of making it familiar with the structure of the data (which is a concatenation of numerous text columns) and then, extract embedding vectors from the entire dataset and find those records with the highest cosine similarity score, and group them together as one single entity.

Is that approach feasible with sentence-transformers or am I getting it entirely wrong? My concern is that, because it relies on the semantics, if I concatenate large sentences together, the model will lose its power.

You say “ concatenation of numerous text columns“
Is this a structure of database table, can you give me a concrete project based on these data structure or some paper ?

@nreimers
Copy link
Member

What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?

Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.

We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.

@svjack
Copy link
Author

svjack commented Mar 31, 2021

What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?

Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.

We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.

i hope this may not in the sense of use crossencoder to generate labels.

@svjack
Copy link
Author

svjack commented Mar 31, 2021

What if I want to train the model on a large corpus of data w/o giving it a label (as I don't know the ground truth) - with the goal to familiarize it with the type of data I have, how should I approach the problem?

Without labels, it will not work well. The best unsupervised approaches are often only on-par to pre-trained models.

We will soon release some code that allows to train without labels. The improvement depends extremely on the domain.

Which unsupervised method you will use ?
can you introduce me some paper?
Thanks

@nreimers
Copy link
Member

Here are two papers on unsupervised sentence embeddings learning:
https://openreview.net/forum?id=Ov_sMNau-PF
https://arxiv.org/abs/2006.03659

@svjack
Copy link
Author

svjack commented Apr 1, 2021

Here are two papers on unsupervised sentence embeddings learning:
https://openreview.net/forum?id=Ov_sMNau-PF
https://arxiv.org/abs/2006.03659

Sparse vector features such as bm25 vector or tfidf vector seems can not used for search directly.
When use SVD (or Autoencoder) the vector can have some sense but only maintain topic features.
It seems like these unsupervised sparse features are worse than use bm25 score from search engine.
Does there exists a method to produce sparse features that can beat bm25 score ?
Or some project can learn to transform some sparse features to dense that can
works like sbert ?
Not simple by weighted average word embedding but can contain some sequence info and train in supervised way ?
May be a metric to matric learning model .(sparse metric align to dense metric produced by sbert) ?
Does this kind of metric surprised model exists ?

@nreimers
Copy link
Member

nreimers commented Apr 2, 2021

Bm25 and tf idf are perfect for Search and are quite hard to beat in the general setup over all possible use cases.

@rc19
Copy link

rc19 commented May 17, 2022

Hi @LeenaShekhar Training the asym. models can be tricky and I don't have a recommended solution yet.

Both models share the same transformer network in that example. Hence, at the start, if a sentence for 'QRY' and 'DOC' is identical, they are mapped to the same point in vector space.

However, as the dense layers are different, they are then moved to completely different locations in vector space. Backpropagation updates not only the dense layer, but also the transformer layers. So the optimizer fails to find a nice configuration for a shared transformer layer but different dense layers.

Here are some options how to solve it:

  • Initialize dense_model_A and dense_model_B with the same weights. Hence, in the beginning, there is no difference. You can initialize it either with the same (random) weights or with a torch.eye() matrix (i.e. dense layer will not change the embedding at the beginning)
  • Instead of shared transformer layer, use two independent transformer, pooling, and dense layer:
asym_model = models.Asym({'QRY': [word_a, pooling_a, dense_model_A], 'DOC': [word_b, pooling_b, dense_model_B]}) model = SentenceTransformer(modules=[asym_model])
  • You can also try to first freeze the shared transformer layer so that the two Dense layers can learn a mapping. Then allow to also update the transformer layer.

If my understanding is correct, can the asym model class be used to train a DPR model (different query/context encoder) using the MNRL loss (with different transformer backbones and without a dense model after the pooling layer)?

@nreimers
Copy link
Member

@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR

@rc19
Copy link

rc19 commented May 17, 2022

@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR

Are you aware of any references for some existing work on it on the top of your mind?

@Muennighoff
Copy link
Contributor

Muennighoff commented May 17, 2022

@rc19 Yes. But note that a single model for query / context works better than the 2 models as used in DPR

Are you aware of any references for some existing work on it on the top of your mind?

Tried it in this paper & found asym to be worse:
Screenshot 2022-05-17 at 19 41 47

@Permafacture
Copy link

So what I'm understanding from this is that, although the DPR examples in the documentation use separate context and question encoders (facebook-dpr-ctx_encoder-single-nq-base and facebook-dpr-question_encoder-single-nq-base), that if we wanted to fine tune a model for that purpose we'd be better off using a single model. Maybe a Multi-QA model would be a good base model to start from, using the MNRL loss.

Is this correct?

If that's the case, I'm looking for a guide to help me do that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants