We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All keras models create and fit a Tokenizer in the _prepare_x_train method.
_prepare_x_train
However, we should not recreate and refit a tokenizer on subsequent train. If a new tokenizer is created, word index won't match the embedding matrix.
Steps to reproduce the behavior:
model.fit(x_train_1, y_train_1) print(model.tokenizer.index_word[50]) >>> 'relation' print(model.model.layers[1].call(tf.convert_to_tensor([50])) >>> returns 'relation' embedding model.fit(x_train_2, y_train_2) print(model.tokenizer.index_word[50]) >>> 'vous' print(model.model.layers[1].call(tf.convert_to_tensor([50])) >>> returns 'relation' embedding instead of 'vous' embedding !!!!
The model's Tokenizer is still the same, and a given word still have the same index.
A new Tokenizer is created, and a given word have a new index that won't match with the model embedding matrix.
That is, did this use to work the way you expected in the past?
A solution could be to add a check before creating a new Tokenizer. For example :
if not self.trained and self.tokenizer is None: self.tokenizer = Tokenizer(num_words=self.max_words, filters=self.tokenizer_filters) self.logger.info('Fitting the tokenizer') self.tokenizer.fit_on_texts(x_train) return self._get_sequence(x_train, self.tokenizer, self.max_sequence_length, padding=self.padding, truncating=self.truncating)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
All keras models create and fit a Tokenizer in the
_prepare_x_train
method.However, we should not recreate and refit a tokenizer on subsequent train.
If a new tokenizer is created, word index won't match the embedding matrix.
Concerned template
To Reproduce
Steps to reproduce the behavior:
Expected behavior
The model's Tokenizer is still the same, and a given word still have the same index.
Actual behavior
A new Tokenizer is created, and a given word have a new index that won't match with the model embedding matrix.
Is this a regression?
That is, did this use to work the way you expected in the past?
Debug info
Additional context
A solution could be to add a check before creating a new Tokenizer.
For example :
The text was updated successfully, but these errors were encountered: