Memory leak in ParquetDataset #998

welsonzhang · 2024-06-19T10:37:51Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
DeepRec version or commit id: 1.15.5+deeprec2208
Python version: python3.6
Bazel version (if compiling from source): bazel 0.26.1
GCC/Compiler version (if compiling from source): 5.4.0
CUDA/cuDNN version: none

Describe the current behavior
Memory leak in ParquetDataset has occured, after run python code, the memory has increase to 3Gb

Describe the expected behavior
Memory stable in ParquetDataset

Code to reproduce the issue

import os

import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops

fields = get_parquet_fields_type(conf)

def make_initializable_iterator(ds):
    if hasattr(dataset_ops, "make_initializable_iterator"):
        return dataset_ops.make_initializable_iterator(ds)
    return ds.make_initializable_iterator()

filename = './output.parquet'

def build_input_fn():
    def parse_parquet(record):
      label = record.pop("clk")
      return record, label
    return parse_parquet


def build_dataset():
    dataset = ParquetDataset(
        filename,
        batch_size=256,
        fields=fields,
        drop_remainder=True,
        )
    dataset = dataset.map(build_input_fn())
    dataset = dataset.prefetch(2)
    return dataset

for i in range(1000):
    ds = build_dataset()
    iterator = make_initializable_iterator(ds)
    with tf.Session() as sess:
      sess.run(iterator.initializer)

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The text was updated successfully, but these errors were encountered:

JackMoriarty · 2024-06-19T10:48:17Z

Why create 1000 Datasets? This will create 1000 Dataset objects in memory.

welsonzhang · 2024-06-20T02:21:46Z

oh sorry, I make a misstake for the iterator. when the iterator is 180, only cost 0.5M. but when the iterator increase to 1000, it will cost 3Gb, why and any adivce to release the dataset?

in our scenario, data source will store in hadoop directory in each hour or each day, we will train 3 month or online learing every 15 min。

day train, for example:
├── data
├── 20240301 # training day1
...
├── 20240530 # training day90

for day in (day1, day90):
     for steps in [eval, train]:
         dataset = build_dataset()
         eval or train

JackMoriarty · 2024-06-20T09:40:29Z

The ParquetDataset supports accepting a list of files.

filenames = [file1, file2] # all parquet file for training
dataset = ParquetDataset(filenames, ...)

Only create one 'ParquetDataset' for training, and another one 'ParquetDataset' for eval.

welsonzhang · 2024-06-20T10:04:07Z

it does create only one 'ParquetDataset' in one day, we will eval_train 90 days dataset, so it will create 180 'ParquetDataset'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in ParquetDataset #998

Memory leak in ParquetDataset #998

welsonzhang commented Jun 19, 2024

JackMoriarty commented Jun 19, 2024

welsonzhang commented Jun 20, 2024 •

edited

Loading

JackMoriarty commented Jun 20, 2024

welsonzhang commented Jun 20, 2024

Memory leak in ParquetDataset #998

Memory leak in ParquetDataset #998

Comments

welsonzhang commented Jun 19, 2024

JackMoriarty commented Jun 19, 2024

welsonzhang commented Jun 20, 2024 • edited Loading

JackMoriarty commented Jun 20, 2024

welsonzhang commented Jun 20, 2024

welsonzhang commented Jun 20, 2024 •

edited

Loading