LSTM layer doesn't learn with TextVectorization output (padding ?) #20898

ttrouill · 2025-02-13T10:18:10Z

I have an issue on my own textual data when training an LSTM for sentiment analysis. I used (Keras 2) to encode the textual data with the old Tokenizer + pad_sequences way, and switched to the new TextVectorization object, but it doesn't learn anymore (losses don't change, accuracies around .50). So I tried with an example from the documentation :

I reran the example "Text classification from scratch" from the doc ( https://keras.io/examples/nlp/text_classification_from_scratch/ ), and it works fine as is (validation accuracy is going up).

Then I replaced the Conv1D and GlobalMaxPooling1D layers by a LSTM layer, and the model doesn't learn, the train and validation accuracies stay around .50.

However if I pass go_backwards=True to the LSTM layer, then it learns correctly (but also reads each text backwards consequently).

It might be due to the fact that the TextVectorization layer "post"-pads the input (the x input vectors are filled with zeros at the end), whereas the LSTM layers expects "pre"-padded inputs (the x input vectors are filled with zeros at the beginning), and thus doesn't iterate at all on the input tokens.

Indeed, the "Bidirectional LSTM on IMDB" https://keras.io/examples/nlp/bidirectional_lstm_imdb/ works well, as it loads an already tokenized (but not padded) version of IMDB, and thus doesn't use the TextVectorization layer, but the keras.utils.pad_sequences function that "pre"-pads by default. What is weird though is that it still does learn when setting padding='post', but the validation accuracy goes up much more slowly at each epoch than with padding='pre'. So it might be more complicated than it seems, but still seems to be a padding issue.

Still it might be easily solved by allowing to choose whether to pre- or post-pad in the TextVectorization layer, similarly to the keras.utils.pad_sequences function parameter padding.

I had the same results on two different machines, on CPU and GPU (Geforce GTX 1650 6go), with tensorflow and JAX backends, on Keras 3.8.0, and python 3.10 and 3.12.

Here is the modified example from "Text classification from scratch" with an LSTM layer instead of the Conv1D and GlobalMaxPooling1D layers :

Standalone code to reproduce the issue

"""
## Introduction

This example shows how to do text classification starting from raw text (as
a set of text files on disk). We demonstrate the workflow on the IMDB sentiment
classification dataset (unprocessed version). We use the `TextVectorization` layer for
 word splitting & indexing.
"""

"""
## Setup
"""

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import keras
import tensorflow as tf
import numpy as np
from keras import layers

"""
## Load the data: IMDB movie review sentiment classification

Let's download the data and inspect its structure.
"""


!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz


"""
The `aclImdb` folder contains a `train` and `test` subfolder:
"""

"""
We are only interested in the `pos` and `neg` subfolders, so let's delete the other subfolder that has text files in it:
"""


!rm -r aclImdb/train/unsup


"""
You can use the utility `keras.utils.text_dataset_from_directory` to
generate a labeled `tf.data.Dataset` object from a set of text files on disk filed
 into class-specific folders.

Let's use it to generate the training, validation, and test datasets. The validation
and training datasets are generated from two subsets of the `train` directory, with 20%
of samples going to the validation dataset and 80% going to the training dataset.

Having a validation dataset in addition to the test dataset is useful for tuning
hyperparameters, such as the model architecture, for which the test dataset should not
be used.

Before putting the model out into the real world however, it should be retrained using all
available training data (without creating a validation dataset), so its performance is maximized.

When using the `validation_split` & `subset` arguments, make sure to either specify a
random seed, or to pass `shuffle=False`, so that the validation & training splits you
get have no overlap.

"""

batch_size = 32
raw_train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=1337,
)
raw_val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)
raw_test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

"""
Let's preview a few samples:
"""

# It's important to take a look at your raw data to ensure your normalization
# and tokenization will work as expected. We can do that by taking a few
# examples from the training set and looking at them.
# This is one of the places where eager execution shines:
# we can just evaluate these tensors using .numpy()
# instead of needing to evaluate them in a Session/Graph context.
for text_batch, label_batch in raw_train_ds.take(1):
    for i in range(5):
        print(text_batch.numpy()[i])
        print(label_batch.numpy()[i])

"""
## Prepare the data

In particular, we remove `<br />` tags.
"""

import string
import re


# Having looked at our data above, we see that the raw text contains HTML break
# tags of the form '<br />'. These tags will not be removed by the default
# standardizer (which doesn't strip HTML). Because of this, we will need to
# create a custom standardization function.
def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    return tf.strings.regex_replace(
        stripped_html, f"[{re.escape(string.punctuation)}]", ""
    )


# Model constants.
max_features = 20000
embedding_dim = 128
sequence_length = 500

# Now that we have our custom standardization, we can instantiate our text
# vectorization layer. We are using this layer to normalize, split, and map
# strings to integers, so we set our 'output_mode' to 'int'.
# Note that we're using the default split function,
# and the custom standardization defined above.
# We also set an explicit maximum sequence length, since the CNNs later in our
# model won't support ragged sequences.
vectorize_layer = keras.layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

# Now that the vectorize_layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)
# Let's call `adapt`:
vectorize_layer.adapt(text_ds)

"""
## Two options to vectorize the data

There are 2 ways we can use our text vectorization layer:

**Option 1: Make it part of the model**, so as to obtain a model that processes raw
 strings, like this:
"""

"""


text_input = keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorize_layer(text_input)
x = layers.Embedding(max_features + 1, embedding_dim)(x)
...


**Option 2: Apply it to the text dataset** to obtain a dataset of word indices, then
 feed it into a model that expects integer sequences as inputs.

An important difference between the two is that option 2 enables you to do
**asynchronous CPU processing and buffering** of your data when training on GPU.
So if you're training the model on GPU, you probably want to go with this option to get
 the best performance. This is what we will do below.

If we were to export our model to production, we'd ship a model that accepts raw
strings as input, like in the code snippet for option 1 above. This can be done after
 training. We do this in the last section.


"""


def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label


# Vectorize the data.
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

"""
## Build a model

We choose a simple 1D convnet starting with an `Embedding` layer.
"""

# A integer input for vocab indices.
inputs = keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + global max pooling
#x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
#x = layers.Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)

x = layers.LSTM(128)(x)
#x = layers.GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = layers.Dense(1, activation="sigmoid", name="predictions")(x)

model = keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

"""
## Train the model
"""

epochs = 3

# Fit the model using the train and test datasets.
model.fit(train_ds, validation_data=val_ds, epochs=epochs)

"""
## Evaluate the model on the test set
"""

model.evaluate(test_ds)

"""
## Make an end-to-end model

If you want to obtain a model capable of processing raw strings, you can simply
create a new model (using the weights we just trained):
"""

# A string input
inputs = keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)

Relevant log output

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782
b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its due. Check it out.'
1
b'First than anything, I\'m not going to praise I\xc3\xb1arritu\'s short film, even I\'m Mexican and proud of his success in mainstream Hollywood.<br /><br />In another hand, I see most of the reviews focuses on their favorite (and not so) short films; but we are forgetting that there is a subtle bottom line that circles the whole compilation, and maybe it will not be so pleasant for American people. (Even if that was not the main purpose of the producers) <br /><br />What i\'m talking about is that most of the short films does not show the suffering that WASP people went through because the terrorist attack on September 11th, but the suffering of the Other people.<br /><br />Do you need proofs about what i\'m saying? Look, in the Bosnia short film, the message is: "You cry because of the people who died in the Towers, but we (The Others = East Europeans) are crying long ago for the crimes committed against our women and nobody pay attention to us like the whole world has done to you".<br /><br />Even though the Burkina Fasso story is more in comedy, there is a the same thought: "You are angry because Osama Bin Laden punched you in an evil way, but we (The Others = Africans) should be more angry, because our people is dying of hunger, poverty and AIDS long time ago, and nobody pay attention to us like the whole world has done to you".<br /><br />Look now at the Sean Penn short: The fall of the Twin Towers makes happy to a lonely (and alienated) man. So the message is that the Power and the Greed (symbolized by the Towers) must fall for letting the people see the sun rise and the flowers blossom? It is remarkable that this terrible bottom line has been proposed by an American. There is so much irony in this short film that it is close to be subversive.<br /><br />Well, the Ken Loach (very know because his anti-capitalism ideology) is much more clearly and shameless in going straight to the point: "You are angry because your country has been attacked by evil forces, but we (The Others = Latin Americans) suffered at a similar date something worst, and nobody remembers our grief as the whole world has done to you".<br /><br />It is like if the creative of this project wanted to say to Americans: "You see now, America? You are not the only that have become victim of the world violence, you are not alone in your pain and by the way, we (the Others = the Non Americans) have been suffering a lot more than you from long time ago; so, we are in solidarity with you in your pain... and by the way, we are sorry because you have had some taste of your own medicine" Only the Mexican and the French short films showed some compassion and sympathy for American people; the others are like a slap on the face for the American State, that is not equal to American People.'
1
b'Blood Castle (aka Scream of the Demon Lover, Altar of Blood, Ivanna--the best, but least exploitation cinema-sounding title, and so on) is a very traditional Gothic Romance film. That means that it has big, creepy castles, a headstrong young woman, a mysterious older man, hints of horror and the supernatural, and romance elements in the contemporary sense of that genre term. It also means that it is very deliberately paced, and that the film will work best for horror mavens who are big fans of understatement. If you love films like Robert Wise\'s The Haunting (1963), but you also have a taste for late 1960s/early 1970s Spanish and Italian horror, you may love Blood Castle, as well.<br /><br />Baron Janos Dalmar (Carlos Quiney) lives in a large castle on the outskirts of a traditional, unspecified European village. The locals fear him because legend has it that whenever he beds a woman, she soon after ends up dead--the consensus is that he sets his ferocious dogs on them. This is quite a problem because the Baron has a very healthy appetite for women. At the beginning of the film, yet another woman has turned up dead and mutilated.<br /><br />Meanwhile, Dr. Ivanna Rakowsky (Erna Sch\xc3\xbcrer) has appeared in the center of the village, asking to be taken to Baron Dalmar\'s castle. She\'s an out-of-towner who has been hired by the Baron for her expertise in chemistry. Of course, no one wants to go near the castle. Finally, Ivanna finds a shady individual (who becomes even shadier) to take her. Once there, an odd woman who lives in the castle, Olga (Cristiana Galloni), rejects Ivanna and says that she shouldn\'t be there since she\'s a woman. Baron Dalmar vacillates over whether she should stay. She ends up staying, but somewhat reluctantly. The Baron has hired her to try to reverse the effects of severe burns, which the Baron\'s brother, Igor, is suffering from.<br /><br />Unfortunately, the Baron\'s brother appears to be just a lump of decomposing flesh in a vat of bizarre, blackish liquid. And furthermore, Ivanna is having bizarre, hallucinatory dreams. Just what is going on at the castle? Is the Baron responsible for the crimes? Is he insane? <br /><br />I wanted to like Blood Castle more than I did. As I mentioned, the film is very deliberate in its pacing, and most of it is very understated. I can go either way on material like that. I don\'t care for The Haunting (yes, I\'m in a very small minority there), but I\'m a big fan of 1960s and 1970s European horror. One of my favorite directors is Mario Bava. I also love Dario Argento\'s work from that period. But occasionally, Blood Castle moved a bit too slow for me at times. There are large chunks that amount to scenes of not very exciting talking alternated with scenes of Ivanna slowly walking the corridors of the castle.<br /><br />But the atmosphere of the film is decent. Director Jos\xc3\xa9 Luis Merino managed more than passable sets and locations, and they\'re shot fairly well by Emanuele Di Cola. However, Blood Castle feels relatively low budget, and this is a Roger Corman-produced film, after all (which usually means a low-budget, though often surprisingly high quality "quickie"). So while there is a hint of the lushness of Bava\'s colors and complex set decoration, everything is much more minimalist. Of course, it doesn\'t help that the Retromedia print I watched looks like a 30-year old photograph that\'s been left out in the sun too long. It appears "washed out", with compromised contrast.<br /><br />Still, Merino and Di Cola occasionally set up fantastic visuals. For example, a scene of Ivanna walking in a darkened hallway that\'s shot from an exaggerated angle, and where an important plot element is revealed through shadows on a wall only. There are also a couple Ingmar Bergmanesque shots, where actors are exquisitely blocked to imply complex relationships, besides just being visually attractive and pulling your eye deep into the frame.<br /><br />The performances are fairly good, and the women--especially Sch\xc3\xbcrer--are very attractive. Merino exploits this fact by incorporating a decent amount of nudity. Sch\xc3\xbcrer went on to do a number of films that were as much soft corn porn as they were other genres, with English titles such as Sex Life in a Woman\'s Prison (1974), Naked and Lustful (1974), Strip Nude for Your Killer (1975) and Erotic Exploits of a Sexy Seducer (1977). Blood Castle is much tamer, but in addition to the nudity, there are still mild scenes suggesting rape and bondage, and of course the scenes mixing sex and death.<br /><br />The primary attraction here, though, is probably the story, which is much a slow-burning romance as anything else. The horror elements, the mystery elements, and a somewhat unexpected twist near the end are bonuses, but in the end, Blood Castle is a love story, about a couple overcoming various difficulties and antagonisms (often with physical threats or harms) to be together.'
1
b"I was talked into watching this movie by a friend who blubbered on about what a cute story this was.<br /><br />Yuck.<br /><br />I want my two hours back, as I could have done SO many more productive things with my time...like, for instance, twiddling my thumbs. I see nothing redeeming about this film at all, save for the eye-candy aspect of it...<br /><br />3/10 (and that's being generous)"
0
b"Michelle Rodriguez is the defining actress who could be the charging force for other actresses to look out for. She has the audacity to place herself in a rarely seen tough-girl role very early in her career (and pull it off), which is a feat that should be recognized. Although her later films pigeonhole her to that same role, this film was made for her ruggedness.<br /><br />Her character is a romanticized student/fighter/lover, struggling to overcome her disenchanted existence in the projects, which is a little overdone in film...but not by a girl. That aspect of this film isn't very original, but the story goes in depth when the heated relationships that this girl has to deal with come to a boil and her primal rage takes over.<br /><br />I haven't seen an actress take such an aggressive stance in movie-making yet, and I'm glad that she's getting that original twist out there in Hollywood. This film got a 7 from me because of the average story of ghetto youth, but it has such a great actress portraying a rarely-seen role in a minimal budget movie. Great work."
1

2025-02-06 11:46:19.332732: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence
2025-02-06 11:46:23.260601: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence

Epoch 1/3
625/625 ━━━━━━━━━━━━━━━━━━━━ 534s 851ms/step - accuracy: 0.5033 - loss: 0.6937 - val_accuracy: 0.5050 - val_loss: 0.6931
Epoch 2/3
625/625 ━━━━━━━━━━━━━━━━━━━━ 638s 1s/step - accuracy: 0.5150 - loss: 0.6917 - val_accuracy: 0.5042 - val_loss: 0.6950
Epoch 3/3
625/625 ━━━━━━━━━━━━━━━━━━━━ 522s 835ms/step - accuracy: 0.5218 - loss: 0.6802 - val_accuracy: 0.5052 - val_loss: 0.7038
782/782 ━━━━━━━━━━━━━━━━━━━━ 186s 238ms/step - accuracy: 0.4966 - loss: 0.7001
782/782 ━━━━━━━━━━━━━━━━━━━━ 191s 244ms/step - accuracy: 0.4974 - loss: 0.7006

[0.7001909613609314, 0.5010799765586853]

The text was updated successfully, but these errors were encountered:

ttrouill · 2025-02-14T12:29:43Z

So I had a look at the way the TextVectorization object does the padding ( https://github.com/keras-team/keras/blob/v3.8.0/keras/src/layers/preprocessing/text_vectorization.py#L568 ). It transforms the input into a RaggedTensor (through the call to tf.strings.split in self._preprocess, and then converts it back to a tensor through RaggedTensor.to_tensor(), that post-pads.

It relates to this tensorflow issue ( tensorflow/tensorflow#34793 (comment) ) that asked for a pre-padding option in the RaggedTensor.to_tensor() method 5 years ago, but it still doesn't exists, and doesn't seem easy to do a general version of it (as required by the last commentator of the issue).

However he proposed a function to pre-pad a 2D RaggedTensor, that might be general enough for the TextVectorization case:

>>> def left_pad_2d_ragged(rt, width):
...   rt = rt[-width:]  # Truncate rows to have at most `width` items
...   pad_row_lengths = width - rt.row_lengths()
...   pad_values = tf.zeros([(width * rt.nrows()) - tf.size(rt, tf.int64)], rt.dtype)
...   padding = tf.RaggedTensor.from_row_lengths(pad_values, pad_row_lengths)
...   return tf.concat([padding, rt], axis=1).to_tensor()

That could be an inspiration to add something like this to support pre-padding that in TextVectorization.__call__, supposing the 2D case is general enough. That would also means handling the case when self._output_sequence_length is not None a the end of __call__ that uses a call to tf.pad that would need to pre-pad too.

Buuuuuut, maybe that's not the right way, and maybe it would be a better idea to look at the LSTM layer and make it handle post-padded input by default, I don't know.

mehtamansi29 · 2025-02-19T09:17:16Z

Hi @ttrouill -

Thanks for reporting the issue. Here in the given gist (almost with your standalone code), if you increase epoch(upto 30) you will see increase in accuracy. And also if you will change the model creation with LSTM as shown gist also increase accuracy and decreasing the loss as well.

github-actions · 2025-03-06T02:04:43Z

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.

ttrouill · 2025-03-12T11:50:02Z

Hi @mehtamansi29,
Thank you for digging into this. Indeed it works with enough iterations, but validation accuracy stalls at 0.5 for about 10 iterations, which is strange, and was not the case before Keras 3.

Moreover, if I reverse the order of the input sequences by passing go_backwards = True to the first LSTM layer, then it starts learning right from the first iteration (I edited the gist you sent me you can have a look). And this is coherent with the previous behaviour of LSTM with Keras 2 (using the old Tokenizer object that pre-pads instead of TextVectorization that post-pads).

So to me this is still a regression compared to Keras 2 and should be labeled as a bug, which seems to be related with the padding produced by TextVectorization, and the one expected by the LSTM layer.

ttrouill · 2025-03-12T15:38:09Z

I just realized I couldn't edit your gist, here is a copy that only adds go_backwards = True to the first layer, and the corresponding output.

google-ml-butler bot added the type:others label Feb 13, 2025

github-actions bot assigned mehtamansi29 Feb 13, 2025

mehtamansi29 added type:support User is asking for help / asking an implementation question. Stackoverflow would be better suited. stat:awaiting response from contributor and removed type:others labels Feb 19, 2025

github-actions bot added the stale label Mar 6, 2025

google-ml-butler bot removed stale stat:awaiting response from contributor labels Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LSTM layer doesn't learn with TextVectorization output (padding ?) #20898

LSTM layer doesn't learn with TextVectorization output (padding ?) #20898

ttrouill commented Feb 13, 2025 •

edited

Loading

ttrouill commented Feb 14, 2025 •

edited

Loading

mehtamansi29 commented Feb 19, 2025

github-actions bot commented Mar 6, 2025

ttrouill commented Mar 12, 2025 •

edited

Loading

ttrouill commented Mar 12, 2025

LSTM layer doesn't learn with TextVectorization output (padding ?) #20898

LSTM layer doesn't learn with TextVectorization output (padding ?) #20898

Comments

ttrouill commented Feb 13, 2025 • edited Loading

Standalone code to reproduce the issue

Relevant log output

ttrouill commented Feb 14, 2025 • edited Loading

mehtamansi29 commented Feb 19, 2025

github-actions bot commented Mar 6, 2025

ttrouill commented Mar 12, 2025 • edited Loading

ttrouill commented Mar 12, 2025

ttrouill commented Feb 13, 2025 •

edited

Loading

ttrouill commented Feb 14, 2025 •

edited

Loading

ttrouill commented Mar 12, 2025 •

edited

Loading