[Python] Too much RAM consumption when using `take` on a memory-mapped table #37766

blackblitz · 2023-09-18T02:35:25Z

Describe the bug, including details regarding any error messages, version, and platform.

I created a random array and wrote it repeatedly to an Arrow IPC file so that the whole array was too large to fit in RAM. Then, I read it by memory mapping. I could slice it without any problem, but when I tried to access the rows based on an arbitrary list of indices by using take, the RAM usage went up until the computer hung. The code is as follows (in which the array length and the number of writes may be adjusted according to your disk space and RAM size):

import numpy as np
import pyarrow as pa
from pyarrow import feather

rng = np.random.default_rng(1337)
data = rng.normal(size=(1000000,))
table = pa.table({'data': data})
sink = pa.output_stream('data.feather')
schema = pa.schema([('data', pa.float64())])
with pa.ipc.new_file(sink, schema) as writer:
    for i in range(1000):
        writer.write_table(table)

table = feather.read_table('data.feather', memory_map=True)
print(table.take([0]))

Component(s)

Python

The text was updated successfully, but these errors were encountered:

blackblitz · 2023-10-05T07:51:04Z

Has anyone looked into this issue?

p-a-a-a-trick · 2023-10-18T20:08:09Z

Mem: 16GB

Got it to pass at 1850 write iters (14.8GB), and fail (iPython, Killed) at 1900 (15.2GB). I can look a bit more into this tomorrow.

Pass:

In [1]: import numpy as np
   ...: import pyarrow as pa
   ...: from pyarrow import feather
   ...: 
   ...: rng = np.random.default_rng(1337)
   ...: data = rng.normal(size=(1000000,))
   ...: table = pa.table({'data': data})
   ...: sink = pa.output_stream('data.feather')
   ...: schema = pa.schema([('data', pa.float64())])
   ...: with pa.ipc.new_file(sink, schema) as writer:
   ...:     for i in range(1850):
   ...:         writer.write_table(table)
   ...: 
   ...: table = feather.read_table('data.feather', memory_map=True)
   ...: print(table.take([0]))
pyarrow.Table
data: double
----
data: [[0.03826822283041585]]

Fail:

In [5]: import numpy as np
   ...: import pyarrow as pa
   ...: from pyarrow import feather
   ...: 
   ...: rng = np.random.default_rng(1337)
   ...: data = rng.normal(size=(1000000,))
   ...: table = pa.table({'data': data})
   ...: sink = pa.output_stream('data.feather')
   ...: schema = pa.schema([('data', pa.float64())])
   ...: with pa.ipc.new_file(sink, schema) as writer:
   ...:     for i in range(1900):
   ...:         writer.write_table(table)
   ...: 
   ...: table = feather.read_table('data.feather', memory_map=True)
   ...: print(table.take([0]))
Killed

felipecrv · 2023-10-23T23:49:38Z

This might be caused by take not dealing well with chunked arrays

#25822

blackblitz added the Type: bug label Sep 18, 2023

github-actions bot added the Component: Python label Sep 18, 2023

kou changed the title ~~Too much RAM consumption when using take on a memory-mapped table~~ [Python] Too much RAM consumption when using take on a memory-mapped table Oct 19, 2023

jorisvandenbossche mentioned this issue Nov 16, 2023

[C++] Take kernel can't handle ChunkedArrays that don't fit in an Array #25822

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Too much RAM consumption when using `take` on a memory-mapped table #37766

[Python] Too much RAM consumption when using `take` on a memory-mapped table #37766

blackblitz commented Sep 18, 2023

blackblitz commented Oct 5, 2023

p-a-a-a-trick commented Oct 18, 2023 •

edited

Loading

felipecrv commented Oct 23, 2023

[Python] Too much RAM consumption when using take on a memory-mapped table #37766

[Python] Too much RAM consumption when using take on a memory-mapped table #37766

Comments

blackblitz commented Sep 18, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

blackblitz commented Oct 5, 2023

p-a-a-a-trick commented Oct 18, 2023 • edited Loading

felipecrv commented Oct 23, 2023

[Python] Too much RAM consumption when using `take` on a memory-mapped table #37766

[Python] Too much RAM consumption when using `take` on a memory-mapped table #37766

p-a-a-a-trick commented Oct 18, 2023 •

edited

Loading