Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

emergenz · 2024-07-01T13:13:06Z

System Info

transformers version: 4.40.0
Platform: Linux-4.18.0-513.11.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Add a breakpoint at:

transformers/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

Line 625 in e655029

decoder_outputs = self.decoder(

Run the example script of EncoderDecoderModel

transformers/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py

Lines 562 to 577 in e655029

    
                   >>> from transformers import EncoderDecoderModel, BertTokenizer 
        
                   >>> import torch 
        
                   >>> tokenizer = BertTokenizer.from_pretrained("google-bert/bert-base-uncased") 
        
                   >>> model = EncoderDecoderModel.from_encoder_decoder_pretrained( 
        
                   ...     "google-bert/bert-base-uncased", "google-bert/bert-base-uncased" 
        
                   ... )  # initialize Bert2Bert from pre-trained checkpoints 
        
                   >>> # training 
        
                   >>> model.config.decoder_start_token_id = tokenizer.cls_token_id 
        
                   >>> model.config.pad_token_id = tokenizer.pad_token_id 
        
                   >>> model.config.vocab_size = model.config.decoder.vocab_size 
        
                   >>> input_ids = tokenizer("This is a really long text", return_tensors="pt").input_ids 
        
                   >>> labels = tokenizer("This is the corresponding summary", return_tensors="pt").input_ids 
        
                   >>> outputs = model(input_ids=input_ids, labels=labels)

Examine the content of decoder_input_ids. You will see that the bos_token has been added twice.

Expected behavior

The bos_token should only be prepended once to the input to the decoder.

The reason for this bug occurring is that BertTokenizer already wraps the text in bos_token and eos_token. Then, during the EncoderDecoderModel forward pass (specifically during construction of the decoder_input_ids from the labels), another bos_token is added.

This means that in order to use EncoderDecoderModel correctly, one has to tokenize the decoder inputs in a way such that an eos_token is added at the end of the text, but a bos_token is NOT added at the beginning. This is funky and clearly not wanted behavior.

The decoder part of an EncoderDecoderModel is rarely initialized with an encoder-only Transformer. Since the tokenizer of decoder-only Transformers do not add bos_tokens nor eos_tokens, I think the correct way to change the behavior of the EncoderDecoderModel would be to add both the eos_token as well as the bos_token in the model forward pass. To clear up confusion, we should change the official example to constructing an EncoderDecoderModel from an encoder-only Transformer and a decoder-only Transformer.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-07-26T08:30:16Z

Hey! Not sure this was reported before, but if you want to propose a general fix, feel free to open a PR!

github-actions · 2024-08-20T08:06:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed Sep 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

emergenz commented Jul 1, 2024

ArthurZucker commented Jul 26, 2024

github-actions bot commented Aug 20, 2024

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Inconsistent special_token addition in EncoderDecoderModel forward pass #31729

Comments

emergenz commented Jul 1, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jul 26, 2024

github-actions bot commented Aug 20, 2024