eos token is truncated when max_length is shorter than the total input seq length #1145

Znull-1220 · 2024-12-24T15:10:35Z

Znull-1220
Dec 24, 2024

I use Transformer Trainer to fine-tune Qwen2.5-7B-Instruct through LoRA method. The model input are prepared by:

model_inputs = tokenizer(
        prompt,
        max_length=1024,
        truncation=True,
        padding="max_length",
    )

I use chat template when preprocess the data, and the format is <im_start> input <im_start> + <im_start> output <im_start>. However, when the total input sequence length is longer than max_length, the eos_token (i.e. <im_end>) is truncated.

Is it normal for fine-tuning? I'd like to know if it is necessary to fix it.

Answered by jklj077

Dec 25, 2024

I believe the safest approach is to cut off entire messages if they exceed the context length, ensuring there are no incomplete messages in the sequence. However, as long as the truncation is done correctly--meaning the target token for the last token in the sequence after truncation is indeed the next token in the sequence before truncation--it shouldn't matter.

Additionally, since you are effectively using <im_start> as an end-of-turn token, I don't think you need to worry about truncation affecting <im_end>.

View full answer

jklj077 · 2024-12-25T07:56:57Z

jklj077
Dec 25, 2024
Maintainer

I believe the safest approach is to cut off entire messages if they exceed the context length, ensuring there are no incomplete messages in the sequence. However, as long as the truncation is done correctly--meaning the target token for the last token in the sequence after truncation is indeed the next token in the sequence before truncation--it shouldn't matter.

Additionally, since you are effectively using <im_start> as an end-of-turn token, I don't think you need to worry about truncation affecting <im_end>.

1 reply

Znull-1220 Dec 25, 2024
Author

Thanks a lot for your explanation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eos token is truncated when max_length is shorter than the total input seq length #1145

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

eos token is truncated when max_length is shorter than the total input seq length #1145

Znull-1220 Dec 24, 2024

Replies: 1 comment · 1 reply

jklj077 Dec 25, 2024 Maintainer

Znull-1220 Dec 25, 2024 Author

Znull-1220
Dec 24, 2024

Replies: 1 comment 1 reply

jklj077
Dec 25, 2024
Maintainer

Znull-1220 Dec 25, 2024
Author