[JAX/FLAX]: CLM Tokenizer Training confusion

Hi,

after looking at the current readme of the [CLM tokenizer training example](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling#train-tokenizer-1), there's something strange in the model configuration:

The `config.json` file looks like this:

```json
GPT2Config {
  "_name_or_path": "./",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.0,
  "bos_token_id": 50256,
  "embd_pdrop": 0.0,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.0,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.16.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

```

Vocab size is 50257, and `eos_token_id` is set to 50256. I think that setting `eos_token_id` is wrong, because of the following example:

```bash
In [10]: tokenizer.convert_ids_to_tokens([1797, 705, 225, 50256])
Out[10]: ['hal', 'lo', 'Ġ', 'Ġgeestigheid']
```

Id *50256* is originally set to `'Ġgeestigheid'`. I'm not 100% sure, but it should be set to 50257 (and thus outside the vocabulary), because of:

```bash
In [7]: tokenizer.encode("hallo <|endoftext|>")
Out[7]: [1797, 705, 225, 50257]
```

It shows that `eos_token` is set to `<|endoftext|>` and from the tokenizer part, `eos_token_id` then should be set to `50257`?!

Now I'm using the official GPT-2 model as reference:

It uses `"eos_token_id": 50256` in the `config.json` file, some tokenizer tests:

```bash
In [6]: tokenizer.eos_token
Out[6]: '<|endoftext|>'

In [7]: tokenizer.eos_token_id
Out[7]: 50256

In [8]: tokenizer.encode("Hello <|endoftext|>")
Out[8]: [15496, 220, 50256]
```

Which is correct.

And there's another issue: after looking at the `tokenizer.json` file for GPT-2, the following entry exists:

```bash
"<|endoftext|>":50256}
```

which is perfect, but: for the own trained vocab this entry does not exist! I'm not sure if this is a bug in the Tokenizers library or intended :thinking: 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[JAX/FLAX]: CLM Tokenizer Training confusion #15072

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[JAX/FLAX]: CLM Tokenizer Training confusion #15072

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions