Description
Hi,
after looking at the current readme of the CLM tokenizer training example, there's something strange in the model configuration:
The config.json
file looks like this:
GPT2Config {
"_name_or_path": "./",
"activation_function": "gelu_new",
"architectures": [
"GPT2LMHeadModel"
],
"attn_pdrop": 0.0,
"bos_token_id": 50256,
"embd_pdrop": 0.0,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt2",
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_inner": null,
"n_layer": 12,
"n_positions": 1024,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.0,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"task_specific_params": {
"text-generation": {
"do_sample": true,
"max_length": 50
}
},
"transformers_version": "4.16.0.dev0",
"use_cache": true,
"vocab_size": 50257
}
Vocab size is 50257, and eos_token_id
is set to 50256. I think that setting eos_token_id
is wrong, because of the following example:
In [10]: tokenizer.convert_ids_to_tokens([1797, 705, 225, 50256])
Out[10]: ['hal', 'lo', 'Ġ', 'Ġgeestigheid']
Id 50256 is originally set to 'Ġgeestigheid'
. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:
In [7]: tokenizer.encode("hallo <|endoftext|>")
Out[7]: [1797, 705, 225, 50257]
It shows that eos_token
is set to <|endoftext|>
and from the tokenizer part, eos_token_id
then should be set to 50257
?!
Now I'm using the official GPT-2 model as reference:
It uses "eos_token_id": 50256
in the config.json
file, some tokenizer tests:
In [6]: tokenizer.eos_token
Out[6]: '<|endoftext|>'
In [7]: tokenizer.eos_token_id
Out[7]: 50256
In [8]: tokenizer.encode("Hello <|endoftext|>")
Out[8]: [15496, 220, 50256]
Which is correct.
And there's another issue: after looking at the tokenizer.json
file for GPT-2, the following entry exists:
"<|endoftext|>":50256}
which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended 🤔