Add chunking logic in read method #56

sunank200 · 2023-04-06T06:48:37Z

Describe the bug
A clear and concise description of what the bug is.
I tried an 11 GB file (zip file of 11 GB) from S3 to GCS on a worker of 500 Mb and it got killed because of memory:

[2023-04-05, 21:03:34 UTC] {local_task_job.py:212} INFO - Task exited with return code Negsignal.SIGKILL

Expected behavior
The read method should only load chunks into memory. Currently, if there are multiple files in a folder each file is loaded into memory. But for scenarios when a single file is very large, we should have a logic to load only chunks at once.

The text was updated successfully, but these errors were encountered:

kaxil · 2023-04-06T12:07:49Z

Currently, if there are multiple files in a folder each file is loaded into memory

Yeah, entire file shouldn't be loaded in the memory. It can be one of the options but not the only option.

Flow (from fastest path to slowest):

Native path
Stream lines/bytes from source to destination via the worker
"naive" path - where we download all files from source to worker and then upload from worker to destination

sunank200 added this to the Universal transfer operator - phase 2 milestone Apr 6, 2023

pankajkoti assigned sunank200 Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add chunking logic in read method #56

Add chunking logic in read method #56

sunank200 commented Apr 6, 2023 •

edited

Loading

kaxil commented Apr 6, 2023

Add chunking logic in read method #56

Add chunking logic in read method #56

Comments

sunank200 commented Apr 6, 2023 • edited Loading

kaxil commented Apr 6, 2023

sunank200 commented Apr 6, 2023 •

edited

Loading