[NLP] Tokenizers should not be recreated on subsequent train #156

LexABzH · 2023-05-17T11:47:55Z

Describe the bug

All keras models create and fit a Tokenizer in the _prepare_x_train method.

However, we should not recreate and refit a tokenizer on subsequent train.
If a new tokenizer is created, word index won't match the embedding matrix.

Concerned template

To Reproduce

Steps to reproduce the behavior:

Create and fit a Keras model on a first dataset
Reload the model, and continue training on a second dataset

model.fit(x_train_1, y_train_1)
print(model.tokenizer.index_word[50])
>>> 'relation'
print(model.model.layers[1].call(tf.convert_to_tensor([50]))
>>> returns 'relation' embedding

model.fit(x_train_2, y_train_2)
print(model.tokenizer.index_word[50])
>>> 'vous'
print(model.model.layers[1].call(tf.convert_to_tensor([50]))
>>> returns 'relation' embedding instead of 'vous' embedding !!!!

Expected behavior

The model's Tokenizer is still the same, and a given word still have the same index.

Actual behavior

A new Tokenizer is created, and a given word have a new index that won't match with the model embedding matrix.

Is this a regression?

That is, did this use to work the way you expected in the past?

Yes
No

Debug info

Gabarit version: 1.3.1
Python version: any
OS version: any

Additional context

A solution could be to add a check before creating a new Tokenizer.
For example :

if not self.trained and self.tokenizer is None:
    self.tokenizer = Tokenizer(num_words=self.max_words, filters=self.tokenizer_filters)
    self.logger.info('Fitting the tokenizer')
    self.tokenizer.fit_on_texts(x_train)
return self._get_sequence(x_train, self.tokenizer, self.max_sequence_length, padding=self.padding, truncating=self.truncating)

The text was updated successfully, but these errors were encountered:

LexABzH added bug Something isn't working NLP Issue related to the NLP template 1 - Critic Critic issue labels May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NLP] Tokenizers should not be recreated on subsequent train #156

[NLP] Tokenizers should not be recreated on subsequent train #156

LexABzH commented May 17, 2023

[NLP] Tokenizers should not be recreated on subsequent train #156

[NLP] Tokenizers should not be recreated on subsequent train #156

Comments

LexABzH commented May 17, 2023

Describe the bug

Concerned template

To Reproduce

Expected behavior

Actual behavior

Is this a regression?

Debug info

Additional context