Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于预训练过程中build_instances 句子切分的问题 #214

Open
ShadowTeamCN opened this issue Oct 18, 2021 · 1 comment
Open

关于预训练过程中build_instances 句子切分的问题 #214

ShadowTeamCN opened this issue Oct 18, 2021 · 1 comment

Comments

@ShadowTeamCN
Copy link
Contributor

以MlmDataset 中 最简单的字粒度为例,不开启full-sentence开关
当样本长度超过max_length时候,样本被切分
然而此时的 [CLS] [SEP] token 却只存在一份,这是由之前的 document 传入的,样本拆分后并没有产生额外的头尾 token
这种行为符合预期么,理论上每个单独的样本都应该具有一个 [CLS] 头 [SEP] 尾

@ydli-ai
Copy link
Collaborator

ydli-ai commented Oct 20, 2021

有道理,这个问题我确认一下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants