Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not load full source into RAM on write_to_deltalake #2255

Open
aersam opened this issue Mar 6, 2024 · 5 comments · May be fixed by #2289
Open

Do not load full source into RAM on write_to_deltalake #2255

aersam opened this issue Mar 6, 2024 · 5 comments · May be fixed by #2289
Assignees
Labels
binding/python Issues for the Python package enhancement New feature or request
Milestone

Comments

@aersam
Copy link
Contributor

aersam commented Mar 6, 2024

Description

In python/lib.rs, the first thing that happens on write_to_deltalake is to collect to batches to a Vec. This loads all RecordBatches into RAM, no? This seems like not a good thing to me. I think the main reason is that write.rs tries to get the schema from the batches, but the schema would have been known in python anyway, so why not pass it directly?

Use Case
I don't want to waste resources ;)

Related Issue(s)

@aersam aersam added the enhancement New feature or request label Mar 6, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 6, 2024

@aersam correct, it's not the efficient way to do that :) Will already mentioned an improvement over that, which I've logged here, no one is working on that yet, so if you want to pick it up feel free :D #1984

@aersam
Copy link
Contributor Author

aersam commented Mar 6, 2024

I can pick it up, but I'd rather do it on the write.rs operation

@ion-elgreco
Copy link
Collaborator

@aersam that's fine!

@aersam
Copy link
Contributor Author

aersam commented Mar 6, 2024

Ok, I see partioning makes this quite complicated 🙂 And MemoryExec of DataFusion is not helpful, so might take some time

@aersam
Copy link
Contributor Author

aersam commented Mar 7, 2024

I'll just implement it using chunks. This is not perfect, but should work and is not as invasive as rewriting the whole partitioning

@rtyler rtyler added this to the Rust v1.0.0 milestone Nov 3, 2024
@rtyler rtyler self-assigned this Nov 3, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Nov 3, 2024
@rtyler rtyler modified the milestones: Rust v1.0.0, v0.23 Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package enhancement New feature or request
Projects
None yet
3 participants