Skip to content
This repository has been archived by the owner on Apr 11, 2024. It is now read-only.

Add chunking logic in read method #56

Open
sunank200 opened this issue Apr 6, 2023 · 1 comment
Open

Add chunking logic in read method #56

sunank200 opened this issue Apr 6, 2023 · 1 comment
Assignees

Comments

@sunank200
Copy link
Collaborator

sunank200 commented Apr 6, 2023

Describe the bug
A clear and concise description of what the bug is.
I tried an 11 GB file (zip file of 11 GB) from S3 to GCS on a worker of 500 Mb and it got killed because of memory:

[2023-04-05, 21:03:34 UTC] {local_task_job.py:212} INFO - Task exited with return code Negsignal.SIGKILL

Expected behavior
The read method should only load chunks into memory. Currently, if there are multiple files in a folder each file is loaded into memory. But for scenarios when a single file is very large, we should have a logic to load only chunks at once.

@kaxil
Copy link
Collaborator

kaxil commented Apr 6, 2023

Currently, if there are multiple files in a folder each file is loaded into memory

Yeah, entire file shouldn't be loaded in the memory. It can be one of the options but not the only option.

Flow (from fastest path to slowest):

  1. Native path
  2. Stream lines/bytes from source to destination via the worker
  3. "naive" path - where we download all files from source to worker and then upload from worker to destination

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants