Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long sentence embedding #364

Open
leleyi opened this issue Aug 17, 2020 · 23 comments
Open

Long sentence embedding #364

leleyi opened this issue Aug 17, 2020 · 23 comments

Comments

@leleyi
Copy link

leleyi commented Aug 17, 2020

Is there a limit on sentence length? I can get the same result using a very long sentence, which way. Thanks

@nreimers
Copy link
Member

The pre trained models have set the max sequence length to 128 word pieces, but this can be increased if needed. BERT in general has a limit of 510 word pieces.

Inputs longer than this will be truncated.

@leleyi
Copy link
Author

leleyi commented Aug 17, 2020 via email

@nreimers
Copy link
Member

Not sure if it is in the last build already, but you can try.

model.max_seq_length = 510

Otherwise, the following works:

model._first_module().max_seq_length = 510

@leleyi
Copy link
Author

leleyi commented Aug 18, 2020 via email

@thesby
Copy link

thesby commented Sep 23, 2020

@nreimers After setting model.max_seq_length = 510, when I try to encode a text about 2000 words, I get an exception

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self

But If I don't set anything about model.max_seq_length, there is no exception with the same long text.

@nreimers
Copy link
Member

Try a smaller value like model.max_seq_length = 500

Some models might add more than 2 special tokens.

@thesby
Copy link

thesby commented Sep 23, 2020

@nreimers Great, thank you

@PhilipMay
Copy link
Contributor

@nreimers when I train a sentence embedding like here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/sts/training_stsbenchmark.py

Do you think it makes a difference when I decrease the sentence length of the language model (to 128 ) that I take for training?

@nreimers
Copy link
Member

Hi @PhilipMay
It depends on your train dataset.

How many sentences are longer than 128? It this is a small fraction, increasing or decreasing the limit will not change anything.

If most sentences are longer than 128, than changing the value can have an impact. The model just trains then on e.g. the first 128 word pieces of the respective sentences.

@thesby
Copy link

thesby commented Sep 24, 2020

@nreimers I use the model xlm-r-100langs-bert-base-nli-stsb-mean-tokens, which supports 512 tokens. Can I change the max to 1024 by setting model.max_seq_length = 1024

An exception occurs when I encode a long text of 2000 words if I set model.max_seq_length = 1024

   1812         # remove once script supports set_grad_enabled
   1813         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 1814     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   1815
   1816

IndexError: index out of range in self

@nreimers
Copy link
Member

Hi @thesby
BERT is limited to 512 tokens (some tokens are reserved for special tokens like [CLS] and [SEP]). Same for XLM-R. Setting max_seq_length to values larger than 509 / 510 will not work.

@thesby
Copy link

thesby commented Sep 24, 2020

I got it, thank you

@lefnire
Copy link

lefnire commented Oct 6, 2020

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:

word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

@jtank38
Copy link

jtank38 commented Oct 8, 2020

I too was also looking for something like Longformer. I basically want document embeddings, I have tried average sentence embeddings (using sentence transformers) but it's a very naive approach it seems.

@lefnire
Copy link

lefnire commented Oct 11, 2020

@jtank38 I think embeddings.mean() isn't naive - it's used in ukplab examples. But doing it over sentences will probably dilute a lot, IMO; seems better to mean longer junks like paragraphs?

@thesby
Copy link

thesby commented Oct 28, 2020

(Correct me if I'm wrong UKPLab) you could also use a transformers model that handles larger sequences, like Longformer:

word_embedding_model = models.Transformer('allenai/longformer-base-4096')
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

4096 in that case. Or you could do some smart batching, where you embed larger paragraphs (eg with Longformer) and mean them all together, eg if you want the embedding of a document or multiple documents. See gpu/nlp.py for batching over multiple paragraphs/documents, then just np.mean(embeddings, axis=0)

@lefnire I am using this model too. But the output embeddings are very similar. So I want to convert xlm-r-bert-base-nli-stsb-mean-tokens to be a longformer model, then load the longformer with sentence_transformers.

But I get stuck at the first step, how to convert the model to be a longformer model? Any suggestion?

@nreimers
Copy link
Member

@thesby Not sure how to do that. You would need do create a longformer structure similar to xlm-r, but then change the attention mechanism so that it does not do full attention, but the attention from longformer.

It does not sound simple to do this

@thesby
Copy link

thesby commented Oct 28, 2020

@nreimers I tried with tutorial https://github.com/allenai/longformer/blob/master/scripts/convert_model_to_long.ipynb, but got error that ' /Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/' is the correct path to a directory containing a config.json file.

But the config.json exists. I found that the format of config.json from sentence-transformers is very different from the original transformers. So It's difficult to convert this model.

@nreimers
Copy link
Member

Check the 0_Transformer folder, this contains the XLM-R model.

The config.json in the top folder is for sentence transformer and stores the information which modules are included in the model (transformer model, pooling layer etc)

@thesby
Copy link

thesby commented Oct 28, 2020

Using 0_Transformer doesn't work. The same error occurs.

@thesby
Copy link

thesby commented Oct 29, 2020

@nreimers Yes, you are right. I got the error since the jupyter not recognize the path "~/.cache/xxx". When I use absolute path, there is no problem.

import logging
import os
import math
from dataclasses import dataclass, field
from transformers import AutoTokenizer, AutoModelForMaskedLM, RobertaForMaskedLM, RobertaTokenizerFast, TextDataset, DataCollatorForLanguageModeling, Trainer
from transformers import TrainingArguments, HfArgumentParser
from transformers.modeling_longformer import LongformerSelfAttention
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
import torch
import numpy as np

class RobertaLongSelfAttention(LongformerSelfAttention):
    def forward(
        self,
        hidden_states,
        attention_mask=None,
        head_mask=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        output_attentions=False,
    ):
        return super().forward(hidden_states, attention_mask=attention_mask, output_attentions=output_attentions)


class RobertaLongForMaskedLM(RobertaForMaskedLM):
    def __init__(self, config):
        super().__init__(config)
        for i, layer in enumerate(self.roberta.encoder.layer):
            # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
            layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)

model_base_name = "xlm-r-100langs-bert-base-nli-stsb-mean-tokens"
def create_long_model(save_model_to, attention_window, max_pos):
    model = AutoModelForMaskedLM.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer")
    tokenizer = AutoTokenizer.from_pretrained("/Users/thesby/.cache/torch/sentence_transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens/0_Transformer", model_max_length=max_pos)
    config = model.config

    # extend position embeddings
    tokenizer.model_max_length = max_pos
    tokenizer.init_kwargs['model_max_length'] = max_pos
    current_max_pos, embed_size = model.roberta.embeddings.position_embeddings.weight.shape
    max_pos += 2  # NOTE: RoBERTa has positions 0,1 reserved, so embedding size is max position + 2
    config.max_position_embeddings = max_pos
    assert max_pos > current_max_pos
    # allocate a larger position embedding matrix
    new_pos_embed = model.roberta.embeddings.position_embeddings.weight.new_empty(max_pos, embed_size)
    # copy position embeddings over and over to initialize the new position embeddings
    k = 2
    step = current_max_pos-2
    print("k", k, "step", step, "weight.shape", model.roberta.embeddings.position_embeddings.weight.shape)
    while k < max_pos - 1:
        print("k", k, new_pos_embed.shape)
        new_pos_embed[k:(k + step)] = model.roberta.embeddings.position_embeddings.weight[2:]
        k += step
    model.roberta.embeddings.position_embeddings.weight.data = new_pos_embed
    model.roberta.embeddings.position_ids = torch.from_numpy(np.arange(new_pos_embed.shape[0], dtype=np.int32)[np.newaxis, :])

    # replace the `modeling_bert.BertSelfAttention` object with `LongformerSelfAttention`
    config.attention_window = [attention_window] * config.num_hidden_layers
    for i, layer in enumerate(model.roberta.encoder.layer):
        longformer_self_attn = LongformerSelfAttention(config, layer_id=i)
        longformer_self_attn.query = layer.attention.self.query
        longformer_self_attn.key = layer.attention.self.key
        longformer_self_attn.value = layer.attention.self.value

        longformer_self_attn.query_global = layer.attention.self.query
        longformer_self_attn.key_global = layer.attention.self.key
        longformer_self_attn.value_global = layer.attention.self.value

        layer.attention.self = longformer_self_attn

    logger.info(f'saving model to {save_model_to}')
    model.save_pretrained(save_model_to)
    tokenizer.save_pretrained(save_model_to)
    return model, tokenizer


@dataclass
class ModelArgs:
    attention_window: int = field(default=512, metadata={"help": "Size of attention window"})
    max_pos: int = field(default=4096, metadata={"help": "Maximum position"})

parser = HfArgumentParser((TrainingArguments, ModelArgs,))


training_args, model_args = parser.parse_args_into_dataclasses(look_for_args_file=False, args=[
    '--output_dir', 'tmp',
    '--warmup_steps', '500',
    '--learning_rate', '0.00003',
    '--weight_decay', '0.01',
    '--adam_epsilon', '1e-6',
    '--max_steps', '3000',
    '--logging_steps', '500',
    '--save_steps', '500',
    '--max_grad_norm', '5.0',
    '--per_gpu_eval_batch_size', '8',
    '--per_gpu_train_batch_size', '2',  # 32GB gpu with fp32
    '--gradient_accumulation_steps', '32',
    '--evaluate_during_training',
    '--do_train',
    '--do_eval',
])
training_args.val_datapath = 'wikitext-103-raw/wiki.valid.raw'
training_args.train_datapath = 'wikitext-103-raw/wiki.train.raw'

# Choose GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_path = f'{training_args.output_dir}/{model_base_name}-{model_args.max_pos}'
if not os.path.exists(model_path):
    os.makedirs(model_path)

logger.info(f'Converting roberta-base into {model_base_name}-{model_args.max_pos}')
create_long_model(save_model_to=model_path, attention_window=model_args.attention_window, 
                  max_pos=model_args.max_pos)

@sadakmed
Copy link
Contributor

Hi @nreimers,

which dataset u think will be good to fine-tune either a model-base on the full length 512 or model-large(1024).
In my case, increasing the model.max_seq_length to cover large text resulted in low performance, I ended up using averaging with max_seq_length=128.

@nreimers
Copy link
Member

I am sadly aware of any good datasets. Maybe some summarization datasets could work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants