Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Long Text Fine-Tuning Support #5532

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

glide-the
Copy link

feat: Implement pack_data_preprocess parameter and integrate with frontend

  • Added pack_data_preprocess parameter to control input handling during training.

    • When set to True, it disables the use of cutoff_len for truncating input, raising an error if the input exceeds the specified length.
  • Updated frontend to reflect the changes in the parameter's behavior.

  • Completed training of the full longwriter-glm4-9b model with the new configuration.

  • Included testing with the specified dataset to validate the implementation.
    7e11f651e5134363a3ed2ee75d92e04
    d403afb9e0e06858d7f3c281bfb4e85
    b64682e344fa1b04a0bb9be3dbc2f8e

Before submitting

_register_template glm4_long
# TODO long-eos task, need your default_system on exists dataset propert 'system' keys
# build inputs with format `<bos> X Y <eos>`
# but labels with format ` Y ` not `<eos>` for long-eos task
- Implemented the `pack_data_preprocess` parameter to control input handling during training.
  - When set to `True`, it disables `cutoff_len` for truncating inputs, raising an error if the input exceeds the specified length.

- Updated the frontend to reflect changes in the parameter's behavior.

- Completed training of the full `longwriter-glm4-9b` model with the new configuration.

- Included testing with the specified dataset to validate the implementation.
@glide-the
Copy link
Author

We are still verifying that the distribution repository does not have relevant code. Use the files in this compressed package to overwrite

https://huggingface.co/THUDM/LongWriter-glm4-9b

glm_long.zip

logger.warning(f"""cutoff_len {cutoff_len} is too small for the input turn_idx: {turn_idx}, drop it.
eg: The eos_indice is exactly one less than the bubble length, causing the last one to be discarded.
""")
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L59 raise exception.
curious why L66 just break?

Copy link
Author

@glide-the glide-the Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pack_data_preprocess is true, cutoff_len is not used for truncating the input

pack_data_preprocess and len(source_ids)+len(target_ids) >= cutoff_len:
Used for verifying the maximum packing of long texts. For example, when the message length is >= 21, it should report an error instead of discarding the data if it doesn't form a complete training pack.

Ideal situation:
image

Cases where an error should be reported:
image

preprocess_packed_supervised_dataset receives batched data from dataset.map. When the number of processing threads is 1, only one process handles the data. The graph is too abstract; normally, it would be divided into batch_size pieces for all processes to handle.


    dataset = dataset.map(
        preprocess_func,
        batched=True,
        batch_size=data_args.preprocessing_batch_size,
        remove_columns=column_names,
        **kwargs,
    )

@hiyouga

format_observation=StringFormatter(slots=["<|observation|>\n{{content}}<|assistant|>"]),
format_tools=ToolFormatter(tool_format="glm4"),
stop_words=["<|user|>", "<|observation|>"],
# default_system= "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to remove commented codes..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@hiyouga hiyouga added pending This problem is yet to be addressed in-progress The related features are in the progress labels Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in-progress The related features are in the progress pending This problem is yet to be addressed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants