Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Too much RAM consumption when using take on a memory-mapped table #37766

Open
blackblitz opened this issue Sep 18, 2023 · 3 comments
Open

Comments

@blackblitz
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I created a random array and wrote it repeatedly to an Arrow IPC file so that the whole array was too large to fit in RAM. Then, I read it by memory mapping. I could slice it without any problem, but when I tried to access the rows based on an arbitrary list of indices by using take, the RAM usage went up until the computer hung. The code is as follows (in which the array length and the number of writes may be adjusted according to your disk space and RAM size):

import numpy as np
import pyarrow as pa
from pyarrow import feather

rng = np.random.default_rng(1337)
data = rng.normal(size=(1000000,))
table = pa.table({'data': data})
sink = pa.output_stream('data.feather')
schema = pa.schema([('data', pa.float64())])
with pa.ipc.new_file(sink, schema) as writer:
    for i in range(1000):
        writer.write_table(table)

table = feather.read_table('data.feather', memory_map=True)
print(table.take([0]))

Component(s)

Python

@blackblitz
Copy link
Author

Has anyone looked into this issue?

@p-a-a-a-trick
Copy link
Contributor

p-a-a-a-trick commented Oct 18, 2023

Mem: 16GB

Got it to pass at 1850 write iters (14.8GB), and fail (iPython, Killed) at 1900 (15.2GB). I can look a bit more into this tomorrow.

Pass:

In [1]: import numpy as np
   ...: import pyarrow as pa
   ...: from pyarrow import feather
   ...: 
   ...: rng = np.random.default_rng(1337)
   ...: data = rng.normal(size=(1000000,))
   ...: table = pa.table({'data': data})
   ...: sink = pa.output_stream('data.feather')
   ...: schema = pa.schema([('data', pa.float64())])
   ...: with pa.ipc.new_file(sink, schema) as writer:
   ...:     for i in range(1850):
   ...:         writer.write_table(table)
   ...: 
   ...: table = feather.read_table('data.feather', memory_map=True)
   ...: print(table.take([0]))
pyarrow.Table
data: double
----
data: [[0.03826822283041585]]

Fail:

In [5]: import numpy as np
   ...: import pyarrow as pa
   ...: from pyarrow import feather
   ...: 
   ...: rng = np.random.default_rng(1337)
   ...: data = rng.normal(size=(1000000,))
   ...: table = pa.table({'data': data})
   ...: sink = pa.output_stream('data.feather')
   ...: schema = pa.schema([('data', pa.float64())])
   ...: with pa.ipc.new_file(sink, schema) as writer:
   ...:     for i in range(1900):
   ...:         writer.write_table(table)
   ...: 
   ...: table = feather.read_table('data.feather', memory_map=True)
   ...: print(table.take([0]))
Killed

@kou kou changed the title Too much RAM consumption when using take on a memory-mapped table [Python] Too much RAM consumption when using take on a memory-mapped table Oct 19, 2023
@felipecrv
Copy link
Contributor

This might be caused by take not dealing well with chunked arrays

#25822

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants