feat: Long Text Fine-Tuning Support #5532

glide-the · 2024-09-24T16:14:53Z

feat: Implement pack_data_preprocess parameter and integrate with frontend

Added pack_data_preprocess parameter to control input handling during training.
- When set to True, it disables the use of cutoff_len for truncating input, raising an error if the input exceeds the specified length.
Updated frontend to reflect the changes in the parameter's behavior.
Completed training of the full longwriter-glm4-9b model with the new configuration.
Included testing with the specified dataset to validate the implementation.

Before submitting

[ * ] Did you read the contributor guideline?
[ * ] Did you write any new necessary tests?

_register_template glm4_long # TODO long-eos task, need your default_system on exists dataset propert 'system' keys # build inputs with format `<bos> X Y <eos>` # but labels with format ` Y ` not `<eos>` for long-eos task

- Implemented the `pack_data_preprocess` parameter to control input handling during training. - When set to `True`, it disables `cutoff_len` for truncating inputs, raising an error if the input exceeds the specified length. - Updated the frontend to reflect changes in the parameter's behavior. - Completed training of the full `longwriter-glm4-9b` model with the new configuration. - Included testing with the specified dataset to validate the implementation.

glide-the · 2024-09-24T16:21:34Z

We are still verifying that the distribution repository does not have relevant code. Use the files in this compressed package to overwrite

https://huggingface.co/THUDM/LongWriter-glm4-9b

glm_long.zip

src/llamafactory/hparams/data_args.py

chengchengpei · 2024-09-25T17:31:54Z

src/llamafactory/data/processors/supervised.py

+                logger.warning(f"""cutoff_len {cutoff_len} is too small for the input turn_idx: {turn_idx}, drop it.
+                            eg: The eos_indice is exactly one less than the bubble length, causing the last one to be discarded.
+                            """)
+                break


L59 raise exception.
curious why L66 just break?

When pack_data_preprocess is true, cutoff_len is not used for truncating the input

pack_data_preprocess and len(source_ids)+len(target_ids) >= cutoff_len:
Used for verifying the maximum packing of long texts. For example, when the message length is >= 21, it should report an error instead of discarding the data if it doesn't form a complete training pack.

Ideal situation:

Cases where an error should be reported:

preprocess_packed_supervised_dataset receives batched data from dataset.map. When the number of processing threads is 1, only one process handles the data. The graph is too abstract; normally, it would be divided into batch_size pieces for all processes to handle.

dataset = dataset.map( preprocess_func, batched=True, batch_size=data_args.preprocessing_batch_size, remove_columns=column_names, **kwargs, )

@hiyouga

chengchengpei · 2024-09-25T17:32:19Z

src/llamafactory/data/template.py

+    format_observation=StringFormatter(slots=["<|observation|>\n{{content}}<|assistant|>"]),
+    format_tools=ToolFormatter(tool_format="glm4"),
+    stop_words=["<|user|>", "<|observation|>"],
+    # default_system= "",


better to remove commented codes..

glide-the added 2 commits September 23, 2024 21:10

glm_long fine-tuning

87ddf38

_register_template glm4_long # TODO long-eos task, need your default_system on exists dataset propert 'system' keys # build inputs with format `<bos> X Y <eos>` # but labels with format ` Y ` not `<eos>` for long-eos task

LongWriter-GLM-4 readme

2b55065

chengchengpei reviewed Sep 25, 2024

View reviewed changes

src/llamafactory/hparams/data_args.py Show resolved Hide resolved

chengchengpei reviewed Sep 25, 2024

View reviewed changes

review

c29223a

hiyouga force-pushed the main branch from 5569125 to b4c7dd3 Compare October 29, 2024 07:32

glide-the added 2 commits November 1, 2024 16:04

统一使用glm4模板

7582b16

示例

6a837c3

hiyouga added pending This problem is yet to be addressed in-progress The related features are in the progress labels Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Long Text Fine-Tuning Support #5532

feat: Long Text Fine-Tuning Support #5532

glide-the commented Sep 24, 2024

glide-the commented Sep 24, 2024

chengchengpei Sep 25, 2024

glide-the Sep 27, 2024 •

edited

Loading

chengchengpei Sep 25, 2024

glide-the Sep 27, 2024

feat: Long Text Fine-Tuning Support #5532

Are you sure you want to change the base?

feat: Long Text Fine-Tuning Support #5532

Conversation

glide-the commented Sep 24, 2024

feat: Implement pack_data_preprocess parameter and integrate with frontend

Before submitting

glide-the commented Sep 24, 2024

chengchengpei Sep 25, 2024

Choose a reason for hiding this comment

glide-the Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

chengchengpei Sep 25, 2024

Choose a reason for hiding this comment

glide-the Sep 27, 2024

Choose a reason for hiding this comment

glide-the Sep 27, 2024 •

edited

Loading