eos token is truncated when max_length is shorter than the total input seq length #1145
-
I use Transformer Trainer to fine-tune Qwen2.5-7B-Instruct through LoRA method. The model input are prepared by:
I use chat template when preprocess the data, and the format is <im_start> input <im_start> + <im_start> output <im_start>. However, when the total input sequence length is longer than max_length, the eos_token (i.e. <im_end>) is truncated. Is it normal for fine-tuning? I'd like to know if it is necessary to fix it. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I believe the safest approach is to cut off entire messages if they exceed the context length, ensuring there are no incomplete messages in the sequence. However, as long as the truncation is done correctly--meaning the target token for the last token in the sequence after truncation is indeed the next token in the sequence before truncation--it shouldn't matter. Additionally, since you are effectively using <im_start> as an end-of-turn token, I don't think you need to worry about truncation affecting <im_end>. |
Beta Was this translation helpful? Give feedback.
I believe the safest approach is to cut off entire messages if they exceed the context length, ensuring there are no incomplete messages in the sequence. However, as long as the truncation is done correctly--meaning the target token for the last token in the sequence after truncation is indeed the next token in the sequence before truncation--it shouldn't matter.
Additionally, since you are effectively using <im_start> as an end-of-turn token, I don't think you need to worry about truncation affecting <im_end>.