Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JAX/FLAX]: CLM Tokenizer Training confusion #15072

Open
stefan-it opened this issue Jan 7, 2022 · 0 comments
Open

[JAX/FLAX]: CLM Tokenizer Training confusion #15072

stefan-it opened this issue Jan 7, 2022 · 0 comments
Assignees
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@stefan-it
Copy link
Collaborator

stefan-it commented Jan 7, 2022

Hi,

after looking at the current readme of the CLM tokenizer training example, there's something strange in the model configuration:

The config.json file looks like this:

GPT2Config {
  "_name_or_path": "./",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.16.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

Vocab size is 50257, and eos_token_id is set to 50256. I think that setting eos_token_id is wrong, because of the following example:

In [10]: tokenizer.convert_ids_to_tokens([1797, 705, 225, 50256])
Out[10]: ['hal', 'lo', 'Ġ', 'Ġgeestigheid']

Id 50256 is originally set to 'Ġgeestigheid'. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:

In [7]: tokenizer.encode("hallo <|endoftext|>")
Out[7]: [1797, 705, 225, 50257]

It shows that eos_token is set to <|endoftext|> and from the tokenizer part, eos_token_id then should be set to 50257?!

Now I'm using the official GPT-2 model as reference:

It uses "eos_token_id": 50256 in the config.json file, some tokenizer tests:

In [6]: tokenizer.eos_token
Out[6]: '<|endoftext|>'

In [7]: tokenizer.eos_token_id
Out[7]: 50256

In [8]: tokenizer.encode("Hello <|endoftext|>")
Out[8]: [15496, 220, 50256]

Which is correct.

And there's another issue: after looking at the tokenizer.json file for GPT-2, the following entry exists:

"<|endoftext|>":50256}

which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended 🤔

@stefan-it stefan-it changed the title [JAX/FLAX] CLM Tokenizer confusion [JAX/FLAX]: CLM Tokenizer Training confusion Jan 7, 2022
@huggingface huggingface deleted a comment from github-actions bot Feb 7, 2022
@patil-suraj patil-suraj added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Feb 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

2 participants