We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi,
after looking at the current readme of the CLM tokenizer training example, there's something strange in the model configuration:
The config.json file looks like this:
config.json
GPT2Config { "_name_or_path": "./", "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.0, "bos_token_id": 50256, "embd_pdrop": 0.0, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 12, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.0, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.16.0.dev0", "use_cache": true, "vocab_size": 50257 }
Vocab size is 50257, and eos_token_id is set to 50256. I think that setting eos_token_id is wrong, because of the following example:
eos_token_id
In [10]: tokenizer.convert_ids_to_tokens([1797, 705, 225, 50256]) Out[10]: ['hal', 'lo', 'Ġ', 'Ġgeestigheid']
Id 50256 is originally set to 'Ġgeestigheid'. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:
'Ġgeestigheid'
In [7]: tokenizer.encode("hallo <|endoftext|>") Out[7]: [1797, 705, 225, 50257]
It shows that eos_token is set to <|endoftext|> and from the tokenizer part, eos_token_id then should be set to 50257?!
eos_token
<|endoftext|>
50257
Now I'm using the official GPT-2 model as reference:
It uses "eos_token_id": 50256 in the config.json file, some tokenizer tests:
"eos_token_id": 50256
In [6]: tokenizer.eos_token Out[6]: '<|endoftext|>' In [7]: tokenizer.eos_token_id Out[7]: 50256 In [8]: tokenizer.encode("Hello <|endoftext|>") Out[8]: [15496, 220, 50256]
Which is correct.
And there's another issue: after looking at the tokenizer.json file for GPT-2, the following entry exists:
tokenizer.json
"<|endoftext|>":50256}
which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended 🤔
The text was updated successfully, but these errors were encountered:
patil-suraj
No branches or pull requests
Hi,
after looking at the current readme of the CLM tokenizer training example, there's something strange in the model configuration:
The
config.json
file looks like this:Vocab size is 50257, and
eos_token_id
is set to 50256. I think that settingeos_token_id
is wrong, because of the following example:Id 50256 is originally set to
'Ġgeestigheid'
. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:In [7]: tokenizer.encode("hallo <|endoftext|>") Out[7]: [1797, 705, 225, 50257]
It shows that
eos_token
is set to<|endoftext|>
and from the tokenizer part,eos_token_id
then should be set to50257
?!Now I'm using the official GPT-2 model as reference:
It uses
"eos_token_id": 50256
in theconfig.json
file, some tokenizer tests:Which is correct.
And there's another issue: after looking at the
tokenizer.json
file for GPT-2, the following entry exists:"<|endoftext|>":50256}
which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended 🤔
The text was updated successfully, but these errors were encountered: