Skip to content

[JAX/FLAX]: CLM Tokenizer Training confusion #15072

Open
@stefan-it

Description

@stefan-it

Hi,

after looking at the current readme of the CLM tokenizer training example, there's something strange in the model configuration:

The config.json file looks like this:

GPT2Config {
  "_name_or_path": "./",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.16.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

Vocab size is 50257, and eos_token_id is set to 50256. I think that setting eos_token_id is wrong, because of the following example:

In [10]: tokenizer.convert_ids_to_tokens([1797, 705, 225, 50256])
Out[10]: ['hal', 'lo', 'Ġ', 'Ġgeestigheid']

Id 50256 is originally set to 'Ġgeestigheid'. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:

In [7]: tokenizer.encode("hallo <|endoftext|>")
Out[7]: [1797, 705, 225, 50257]

It shows that eos_token is set to <|endoftext|> and from the tokenizer part, eos_token_id then should be set to 50257?!

Now I'm using the official GPT-2 model as reference:

It uses "eos_token_id": 50256 in the config.json file, some tokenizer tests:

In [6]: tokenizer.eos_token
Out[6]: '<|endoftext|>'

In [7]: tokenizer.eos_token_id
Out[7]: 50256

In [8]: tokenizer.encode("Hello <|endoftext|>")
Out[8]: [15496, 220, 50256]

Which is correct.

And there's another issue: after looking at the tokenizer.json file for GPT-2, the following entry exists:

"<|endoftext|>":50256}

which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended 🤔

Metadata

Metadata

Assignees

Labels

WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions