Long sentence embedding #364

leleyi · 2020-08-17T04:24:56Z

Is there a limit on sentence length? I can get the same result using a very long sentence, which way. Thanks

nreimers · 2020-08-17T07:43:54Z

The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces.

Inputs longer than this will be truncated.

leleyi · 2020-08-17T12:05:14Z

Thank you.and How can I increase the max sequence length?

…

------------------ 原始邮件 ------------------ 发件人: "Nils Reimers"<[email protected]>; 发送时间: 2020年8月17日(星期一) 下午3:44 收件人: "UKPLab/sentence-transformers"<[email protected]>; 抄送: "乐乐矣"<[email protected]>; "Author"<[email protected]>; 主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364) The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces. Inputs longer than this will be truncated. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

nreimers · 2020-08-17T20:07:01Z

Not sure if it is in the last build already, but you can try.

model.max_seq_length = 510

Otherwise, the following works:

model._first_module().max_seq_length = 510

leleyi · 2020-08-18T00:47:31Z

Thank you very much; It helps me a lot

…

------------------ 原始邮件 ------------------ 发件人: "Nils Reimers"<[email protected]>; 发送时间: 2020年8月18日(星期二) 凌晨4:07 收件人: "UKPLab/sentence-transformers"<[email protected]>; 抄送: "乐乐矣"<[email protected]>; "Author"<[email protected]>; 主题: Re: [UKPLab/sentence-transformers] Long sentence embedding (#364) Not sure if it is in the last build already, but you can try. model.max_seq_length = 510 Otherwise, the following works: model._first_module().max_seq_length = 510 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

thesby · 2020-09-23T12:58:32Z

@nreimers After setting model.max_seq_length = 510, when I try to encode a text about 2000 words, I get an exception

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self

But If I don't set anything about model.max_seq_length, there is no exception with the same long text.

nreimers · 2020-09-23T12:59:59Z

Try a smaller value like model.max_seq_length = 500

Some models might add more than 2 special tokens.

thesby · 2020-09-23T13:02:03Z

@nreimers Great, thank you

PhilipMay · 2020-09-23T14:28:16Z

@nreimers when I train a sentence embedding like here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py

Do you think it makes a difference when I decrease the sentence length of the language model (to 128 ) that I take for training?

nreimers · 2020-09-23T14:57:26Z

Hi @PhilipMay
It depends on your train dataset.

How many sentences are longer than 128? It this is a small fraction, increasing or decreasing the limit will not change anything.

If most sentences are longer than 128, than changing the value can have an impact. The model just trains then on e.g. the first 128 word pieces of the respective sentences.

thesby · 2020-09-24T02:25:12Z

@nreimers I use the model xlm-r-100langs-bert-base-nli-stsb-mean-tokens, which supports 512 tokens. Can I change the max to 1024 by setting model.max_seq_length = 1024？

An exception occurs when I encode a long text of 2000 words if I set model.max_seq_length = 1024

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self

nreimers · 2020-09-24T04:37:16Z

Hi @thesby
BERT is limited to 512 tokens (some tokens are reserved for special tokens like [CLS] and [SEP]). Same for XLM-R. Setting max_seq_length to values larger than 509 / 510 will not work.

thesby · 2020-09-24T06:35:12Z

I got it, thank you

lefnire · 2020-10-06T07:16:53Z

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:

word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

jtank38 · 2020-10-08T07:33:33Z

I too was also looking for something like Longformer. I basically want document embeddings, I have tried average sentence embeddings (using sentence transformers) but it's a very naive approach it seems.

lefnire · 2020-10-11T00:51:56Z

@jtank38 I think embeddings.mean() isn't naive - it's used in ukplab examples. But doing it over sentences will probably dilute a lot, IMO; seems better to mean longer junks like paragraphs?

thesby · 2020-10-28T07:20:35Z

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:
word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

@lefnire I am using this model too. But the output embeddings are very similar. So I want to convert xlm-r-bert-base-nli-stsb-mean-tokens to be a longformer model, then load the longformer with sentence_transformers.

But I get stuck at the first step, how to convert the model to be a longformer model? Any suggestion?

nreimers · 2020-10-28T07:44:14Z

@thesby Not sure how to do that. You would need do create a longformer structure similar to xlm-r, but then change the attention mechanism so that it does not do full attention, but the attention from longformer.

It does not sound simple to do this

thesby · 2020-10-28T08:28:40Z

@nreimers I tried with tutorial https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb, but got error that ' /Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/' is the correct path to a directory containing a config.json file.

But the config.json exists. I found that the format of config.json from sentence-transformers is very different from the original transformers. So It's difficult to convert this model.

nreimers · 2020-10-28T08:42:12Z

Check the 0_Transformer folder, this contains the XLM-R model.

The config.json in the top folder is for sentence transformer and stores the information which modules are included in the model (transformer model, pooling layer etc)

thesby · 2020-10-28T09:34:27Z

Using 0_Transformer doesn't work. The same error occurs.

thesby · 2020-10-29T03:00:00Z

@nreimers Yes, you are right. I got the error since the jupyter not recognize the path "~/.cache/xxx". When I use absolute path, there is no problem.

import logging
import os
import math
from dataclasses import dataclass, field
from transformers import AutoTokenizer, AutoModelForMaskedLM, RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
import torch
import numpy as np

class RobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)


class RobertaLongForMaskedLM(RobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

model_base_name = "xlm-r-100langs-bert-base-nli-stsb-mean-tokens"
def create_long_model(save_model_to, attention_window, max_pos):
    model = AutoModelForMaskedLM.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer")
    tokenizer = AutoTokenizer.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer", model_max_length=max_pos)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos-2
    print("k", k, "step", step, "weight.shape", model.roberta.embeddings.position_embeddings.weight.shape)
    while k < max_pos - 1:
        print("k", k, new_pos_embed.shape)
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
    model.roberta.embeddings.position_ids = torch.from_numpy(np.arange(new_pos_embed.shape[0], dtype=np.int32)[np.newaxis, :])

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer


@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))


training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '3000',
    '--logging_steps', '500',
    '--save_steps', '500',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '8',
    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluate_during_training',
    '--do_train',
    '--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_path = f'{training_args.output_dir}/{model_base_name}-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting roberta-base into {model_base_name}-{model_args.max_pos}')
create_long_model(save_model_to=model_path, attention_window=model_args.attention_window, 
                  max_pos=model_args.max_pos)

sadakmed · 2021-06-16T18:19:24Z

Hi @nreimers,

which dataset u think will be good to fine-tune either a model-base on the full length 512 or model-large(1024).
In my case, increasing the model.max_seq_length to cover large text resulted in low performance, I ended up using averaging with max_seq_length=128.

nreimers · 2021-06-16T20:58:53Z

I am sadly aware of any good datasets. Maybe some summarization datasets could work?

ulf1 mentioned this issue Oct 29, 2021

restrict sentence length ulf1/simiscore-semantic#10

Closed

beshr-eldebuch mentioned this issue Jul 28, 2022

Fix max_seq_length issue in BERT models. MaartenGr/KeyBERT#124

Open

bglearning mentioned this issue Sep 19, 2022

EmbeddingRetriever does not account for longer documents deepset-ai/haystack#3240

Closed

1 task

nil-andreu mentioned this issue Jan 9, 2023

Store Message embedding LAION-AI/Open-Assistant#540

Merged

thiswillbeyourgithub mentioned this issue Sep 24, 2023

Pretrained multilingual model for sentence embedding with a Max Sequence Length > 128 #1476

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long sentence embedding #364

Long sentence embedding #364

leleyi commented Aug 17, 2020 •

edited

Loading

nreimers commented Aug 17, 2020

leleyi commented Aug 17, 2020 via email

nreimers commented Aug 17, 2020

leleyi commented Aug 18, 2020 via email

thesby commented Sep 23, 2020

nreimers commented Sep 23, 2020

thesby commented Sep 23, 2020

PhilipMay commented Sep 23, 2020

nreimers commented Sep 23, 2020

thesby commented Sep 24, 2020 •

edited

Loading

nreimers commented Sep 24, 2020

thesby commented Sep 24, 2020

lefnire commented Oct 6, 2020

jtank38 commented Oct 8, 2020 •

edited

Loading

lefnire commented Oct 11, 2020

thesby commented Oct 28, 2020

nreimers commented Oct 28, 2020

thesby commented Oct 28, 2020

nreimers commented Oct 28, 2020

thesby commented Oct 28, 2020

thesby commented Oct 29, 2020 •

edited

Loading

sadakmed commented Jun 16, 2021

nreimers commented Jun 16, 2021

Long sentence embedding #364

Long sentence embedding #364

Comments

leleyi commented Aug 17, 2020 • edited Loading

nreimers commented Aug 17, 2020

leleyi commented Aug 17, 2020 via email

nreimers commented Aug 17, 2020

leleyi commented Aug 18, 2020 via email

thesby commented Sep 23, 2020

nreimers commented Sep 23, 2020

thesby commented Sep 23, 2020

PhilipMay commented Sep 23, 2020

nreimers commented Sep 23, 2020

thesby commented Sep 24, 2020 • edited Loading

nreimers commented Sep 24, 2020

thesby commented Sep 24, 2020

lefnire commented Oct 6, 2020

jtank38 commented Oct 8, 2020 • edited Loading

lefnire commented Oct 11, 2020

thesby commented Oct 28, 2020

nreimers commented Oct 28, 2020

thesby commented Oct 28, 2020

nreimers commented Oct 28, 2020

thesby commented Oct 28, 2020

thesby commented Oct 29, 2020 • edited Loading

sadakmed commented Jun 16, 2021

nreimers commented Jun 16, 2021

leleyi commented Aug 17, 2020 •

edited

Loading

thesby commented Sep 24, 2020 •

edited

Loading

jtank38 commented Oct 8, 2020 •

edited

Loading

thesby commented Oct 29, 2020 •

edited

Loading