- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Since 21.0.0 memory usage grows a lot when repeatedly reading a parquet dataset on the local disk. With version 20.0.0 the memory usage increased much less.
Script to reproduce
import tempfile
import numpy
import pyarrow
import pyarrow.dataset
import pyarrow.parquet
from memory_profiler import profile
def test_memory_leak():
    num_columns = 10
    num_rows = 5_000_000
    data = {f"col_{i}": numpy.random.rand(num_rows) for i in range(num_columns)}
    table = pyarrow.Table.from_pydict(data)
    with tempfile.TemporaryDirectory() as temp_dir:
        pyarrow.dataset.write_dataset(table, temp_dir, format="parquet")
        @profile
        def read():
            return pyarrow.dataset.dataset(temp_dir).to_table()
        for _ in range(50):
            read()
if __name__ == "__main__":
    test_memory_leak()
Environment
Ubuntu 24.04.2 LTS
Tested python 3.10.15 and python 3.12.3
Python packages:
$ pip freeze
memory-profiler==0.61.0
numpy==2.3.2
psutil==7.0.0
pyarrow==21.0.0
When using pyarrow==21.0.0 the memory usage increases with the iterations. After the first read its at about 1.5GiB. After the 50th read its at about 20GiB. If I run the same test with pyarrow==20.0.0 the memory usage still increases slightly with the iterations but its still less than 2GiB after the 50th iteration.
Debugging
I ran a git bisect and identified #45979 as the change point. Building from the 21.0.0 release commit with ARROW_MIMALLOC=OFF also solves the problem.
Component(s)
C++, Python