-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Topic] 数据配比对训练的影响 #7
Comments
参考资料要解决的问题预训练包含不同来源的数据,那不同数据之间可能会产生:相互增益、相互冲突甚至毫不相干的关系,此时如何评估不同数据之间对模型效果的影响,如何调整不同数据之间的配比进而平衡好各领域的能力,如何让避免冲突数据且让相关数据相互增益进而激发模型最大的能力。 剩下的部分不读了问题定义、实验内容都不始很详细,比如如何判断不同数据之间是否存在冲突或相互增益就没说清楚。 关于这篇论文,大家只需要知道一个重点:数据配比很重要。 |
参考资料Doremi: Optimizing data mixtures speeds up language model pretraining |
参考资料An empirical study of catastrophic forgetting in large language models during continual fine-tuning |
参考内容:Data Mixing Made Efficient: A Bivariate Scaling Law for Language Model Pretraining为了研究不同训练数据类别和配比对于训练效果的影响,作者开发了一个 低成本数据混合策略,用来验证不同数据配比对模型效果的影响。 数据集Pile和ROOTS 数据集都是认为规定了数据分布范围和比例。 proposed |
Datasets for Large Language Models: A Comprehensive Survey |
[8.22-8.30] 这段时间想研究这个子方向
The text was updated successfully, but these errors were encountered: