Keras NLP data generator for loading large text dataset #1621

zhubarb · 2024-05-06T14:59:18Z

zhubarb
May 6, 2024

I have modified a Google Colab tutorial to finetune GPT-2. I have a large text dataset and only 22.3 GB GPU VRAM (L4 GPU, High ram).

My package versions are:

print(tf.__version__)
print(keras.__version__)
print(keras_nlp.__version__)
2.16.1
3.3.3
0.11.1

I can load my data to a list on the instance:

# Create train data
google_gpt2_drive = './gdrive/MyDrive/gpt2/'
start = time.time()
cnn_ds_df = pd.read_csv(os.path.join(google_gpt2_drive, 'results_train_text_tokens_smaller_than_512.csv'))
end = time.time()
print("TOTAL TIME ELAPSED: ", end - start)
cnn_ds_df['training'] = cnn_ds_df['article'] + "\nTL;DR: " + cnn_ds_df['summary']
training_list = cnn_ds_df['training'].tolist()

However, if I try to read the whole dataset as below, my instance crashes.

tf_train_ds = tf.data.Dataset.from_tensor_slices(training_list).batch(28)

I believe I need to create a DataGenerator that can load my data in batch sizes of 28, which is the most I can fit into memory with 22.3GB VRam. Am I correct? If not, how can I manage training with this "large" dataset?

For more context, the below is how I do the fitting, which runs for datasets that I can fit to memory:

# Attempt fine-tune
gpt2_lm.include_preprocessing = True

num_epochs = 1

lr = tf.keras.optimizers.schedules.PolynomialDecay(
    5e-5,
    decay_steps=tf_train_ds.cardinality() * num_epochs,
    end_learning_rate=0.0,
)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
gpt2_lm.compile(
    optimizer=keras.optimizers.Adam(lr),
    loss=loss,
    weighted_metrics=["accuracy"])

gpt2_lm.fit(tf_train_ds.batch(28), epochs=num_epochs)

zhubarb · 2024-05-06T15:45:06Z

zhubarb
May 6, 2024
Author

This was happening because there were NaN's in some of my text. So, when I did a deep dive, I could see ValueError: Can't convert Python sequence with mixed types to Tensor. I addressed this and it works for now.

However, I am still interested in knowing whether there is a DataGenerator setup for text data in Keras NLP?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keras NLP data generator for loading large text dataset #1621

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Keras NLP data generator for loading large text dataset #1621

zhubarb May 6, 2024

Replies: 1 comment

zhubarb May 6, 2024 Author

zhubarb
May 6, 2024

zhubarb
May 6, 2024
Author