-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long sentence embedding #364
Comments
The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces. Inputs longer than this will be truncated. |
Thank you.and How can I increase the max sequence length?
…------------------ 原始邮件 ------------------
发件人: "Nils Reimers"<[email protected]>;
发送时间: 2020年8月17日(星期一) 下午3:44
收件人: "UKPLab/sentence-transformers"<[email protected]>;
抄送: "乐乐矣"<[email protected]>; "Author"<[email protected]>;
主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364)
The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces.
Inputs longer than this will be truncated.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Not sure if it is in the last build already, but you can try.
Otherwise, the following works:
|
Thank you very much; It helps me a lot
…------------------ 原始邮件 ------------------
发件人: "Nils Reimers"<[email protected]>;
发送时间: 2020年8月18日(星期二) 凌晨4:07
收件人: "UKPLab/sentence-transformers"<[email protected]>;
抄送: "乐乐矣"<[email protected]>; "Author"<[email protected]>;
主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364)
Not sure if it is in the last build already, but you can try.
model.max_seq_length = 510
Otherwise, the following works:
model._first_module().max_seq_length = 510
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@nreimers After setting
But If I don't set anything about |
Try a smaller value like model.max_seq_length = 500 Some models might add more than 2 special tokens. |
@nreimers Great, thank you |
@nreimers when I train a sentence embedding like here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py Do you think it makes a difference when I decrease the sentence length of the language model (to 128 ) that I take for training? |
Hi @PhilipMay How many sentences are longer than 128? It this is a small fraction, increasing or decreasing the limit will not change anything. If most sentences are longer than 128, than changing the value can have an impact. The model just trains then on e.g. the first 128 word pieces of the respective sentences. |
@nreimers I use the model An exception occurs when I encode a long text of 2000 words if I set
|
Hi @thesby |
I got it, thank you |
(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer: word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model]) 4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just |
I too was also looking for something like Longformer. I basically want document embeddings, I have tried average sentence embeddings (using sentence transformers) but it's a very naive approach it seems. |
@jtank38 I think |
@lefnire I am using this model too. But the output embeddings are very similar. So I want to convert But I get stuck at the first step, how to convert the model to be a longformer model? Any suggestion? |
@thesby Not sure how to do that. You would need do create a longformer structure similar to xlm-r, but then change the attention mechanism so that it does not do full attention, but the attention from longformer. It does not sound simple to do this |
@nreimers I tried with tutorial But the config.json exists. I found that the format of config.json from |
Check the 0_Transformer folder, this contains the XLM-R model. The config.json in the top folder is for sentence transformer and stores the information which modules are included in the model (transformer model, pooling layer etc) |
Using 0_Transformer doesn't work. The same error occurs. |
@nreimers Yes, you are right. I got the error since the jupyter not recognize the path "~/.cache/xxx". When I use absolute path, there is no problem. import logging
import os
import math
from dataclasses import dataclass, field
from transformers import AutoTokenizer, AutoModelForMaskedLM, RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
import torch
import numpy as np
class RobertaLongSelfAttention(LongformerSelfAttention):
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
):
return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)
class RobertaLongForMaskedLM(RobertaForMaskedLM):
def __init__(self, config):
super().__init__(config)
for i, layer in enumerate(self.roberta.encoder.layer):
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)
model_base_name = "xlm-r-100langs-bert-base-nli-stsb-mean-tokens"
def create_long_model(save_model_to, attention_window, max_pos):
model = AutoModelForMaskedLM.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer")
tokenizer = AutoTokenizer.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer", model_max_length=max_pos)
config = model.config
# extend position embeddings
tokenizer.model_max_length = max_pos
tokenizer.init_kwargs['model_max_length'] = max_pos
current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
max_pos += 2 # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
config.max_position_embeddings = max_pos
assert max_pos > current_max_pos
# allocate a larger position embedding matrix
new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
# copy position embeddings over and over to initialize the new position embeddings
k = 2
step = current_max_pos-2
print("k", k, "step", step, "weight.shape", model.roberta.embeddings.position_embeddings.weight.shape)
while k < max_pos - 1:
print("k", k, new_pos_embed.shape)
new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
k += step
model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
model.roberta.embeddings.position_ids = torch.from_numpy(np.arange(new_pos_embed.shape[0], dtype=np.int32)[np.newaxis, :])
# replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
config.attention_window = [attention_window] * config.num_hidden_layers
for i, layer in enumerate(model.roberta.encoder.layer):
longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
longformer_self_attn.query = layer.attention.self.query
longformer_self_attn.key = layer.attention.self.key
longformer_self_attn.value = layer.attention.self.value
longformer_self_attn.query_global = layer.attention.self.query
longformer_self_attn.key_global = layer.attention.self.key
longformer_self_attn.value_global = layer.attention.self.value
layer.attention.self = longformer_self_attn
logger.info(f'saving model to {save_model_to}')
model.save_pretrained(save_model_to)
tokenizer.save_pretrained(save_model_to)
return model, tokenizer
@dataclass
class ModelArgs:
attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
max_pos: int = field(default=4096, metadata={"help": "Maximum position"})
parser = HfArgumentParser((TrainingArguments, ModelArgs,))
training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
'--output_dir', 'tmp',
'--warmup_steps', '500',
'--learning_rate', '0.00003',
'--weight_decay', '0.01',
'--adam_epsilon', '1e-6',
'--max_steps', '3000',
'--logging_steps', '500',
'--save_steps', '500',
'--max_grad_norm', '5.0',
'--per_gpu_eval_batch_size', '8',
'--per_gpu_train_batch_size', '2', # 32GB gpu with fp32
'--gradient_accumulation_steps', '32',
'--evaluate_during_training',
'--do_train',
'--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'
# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_path = f'{training_args.output_dir}/{model_base_name}-{model_args.max_pos}'
if not os.path.exists(model_path):
os.makedirs(model_path)
logger.info(f'Converting roberta-base into {model_base_name}-{model_args.max_pos}')
create_long_model(save_model_to=model_path, attention_window=model_args.attention_window,
max_pos=model_args.max_pos) |
Hi @nreimers, which dataset u think will be good to fine-tune either a model-base on the full length 512 or model-large(1024). |
I am sadly aware of any good datasets. Maybe some summarization datasets could work? |
Is there a limit on sentence length? I can get the same result using a very long sentence, which way. Thanks
The text was updated successfully, but these errors were encountered: