Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on Support for Qwen2.5 Models and Large Model Training Capabilities #43

Open
ArcherShirou opened this issue Oct 21, 2024 · 3 comments

Comments

@ArcherShirou
Copy link

ArcherShirou commented Oct 21, 2024

I would like to inquire if there are plans to support the Qwen2.5 && Qwen2 series or other popular models from the open-source community, such as Yi. Will the framework support the merging of large models, like the 72B version, similar to MergeKit? Given that running a 72B model requires a significant amount of memory, will the training phase accommodate quantization and LoRA to enable it to run on a single machine with 8 A800 GPUs? Additionally, will DeepSpeed be supported for distributed training? If it could support the merging and training of common model sizes such as 72B, 70B, 34B, 14B, and 7B, it would greatly enhance the applicability of the methods.

@SolshineCode
Copy link

I've added a PR regarding adding Qwen 2.5 and Qwen 2 series, part 1 towards integrating them:
#45

I was thinking similarly with quantization and LORA. I don't think LORA would work here though because the DAM method uses the Logits directly.

@shamanez
Copy link
Member

Thanks, @ArcherShirou, for exploring our codebase.

would like to inquire if there are plans to support the Qwen2.5 && Qwen2 series or other popular models from the open-source community, such as Yi. Will the framework support the merging of large models, like the 72B version, similar to MergeKit?

Definalty we could do this, as I mentioned in the #45

Given that running a 72B model requires a significant amount of memory, will the training phase accommodate quantization and LoRA to enable it to run on a single machine with 8 A800 GPUs?

Actually, quantized models can only be trained with adapter methods like LORA. But in our method, we only train merging coefficients, which is a pretty small number of parameters. So, the problem will only come with the VRAM consumption during the model loading.

We already tried Deepseed, and you can check it out in the "legacy" folder. But for our experiments, deepspeed gave us an OOM issue since we tried to merge three different 7B models, where the merged model had around 22B parameters. As far as I remember, the number of training parameters was around 3 million.

@ArcherShirou
Copy link
Author

Thank you for your response. I‘m looking forward to the updates to the framework; It is really a fantastic work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants