Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NLP] Tokenizers should not be recreated on subsequent train #156

Open
2 of 7 tasks
LexABzH opened this issue May 17, 2023 · 0 comments
Open
2 of 7 tasks

[NLP] Tokenizers should not be recreated on subsequent train #156

LexABzH opened this issue May 17, 2023 · 0 comments
Labels
1 - Critic Critic issue bug Something isn't working NLP Issue related to the NLP template

Comments

@LexABzH
Copy link
Collaborator

LexABzH commented May 17, 2023

Describe the bug

All keras models create and fit a Tokenizer in the _prepare_x_train method.

However, we should not recreate and refit a tokenizer on subsequent train.
If a new tokenizer is created, word index won't match the embedding matrix.

Concerned template

  • NLP template
  • NUM template
  • VISION template
  • API template
  • How templates are generated - Jinja

To Reproduce

Steps to reproduce the behavior:

  1. Create and fit a Keras model on a first dataset
  2. Reload the model, and continue training on a second dataset
model.fit(x_train_1, y_train_1)
print(model.tokenizer.index_word[50])
>>> 'relation'
print(model.model.layers[1].call(tf.convert_to_tensor([50]))
>>> returns 'relation' embedding

model.fit(x_train_2, y_train_2)
print(model.tokenizer.index_word[50])
>>> 'vous'
print(model.model.layers[1].call(tf.convert_to_tensor([50]))
>>> returns 'relation' embedding instead of 'vous' embedding !!!!

Expected behavior

The model's Tokenizer is still the same, and a given word still have the same index.

Actual behavior

A new Tokenizer is created, and a given word have a new index that won't match with the model embedding matrix.

Is this a regression?

That is, did this use to work the way you expected in the past?

  • Yes
  • No

Debug info

  • Gabarit version: 1.3.1
  • Python version: any
  • OS version: any

Additional context

A solution could be to add a check before creating a new Tokenizer.
For example :

if not self.trained and self.tokenizer is None:
    self.tokenizer = Tokenizer(num_words=self.max_words, filters=self.tokenizer_filters)
    self.logger.info('Fitting the tokenizer')
    self.tokenizer.fit_on_texts(x_train)
return self._get_sequence(x_train, self.tokenizer, self.max_sequence_length, padding=self.padding, truncating=self.truncating)
@LexABzH LexABzH added bug Something isn't working NLP Issue related to the NLP template 1 - Critic Critic issue labels May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Critic Critic issue bug Something isn't working NLP Issue related to the NLP template
Projects
None yet
Development

No branches or pull requests

1 participant