Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Z-Order with larger dataset resulting in memory error #2284

Closed
pyjads opened this issue Mar 13, 2024 · 4 comments
Closed

Z-Order with larger dataset resulting in memory error #2284

pyjads opened this issue Mar 13, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@pyjads
Copy link

pyjads commented Mar 13, 2024

Environment

Windows (8 GB RAM)

Delta-rs version: 0.16.0


Bug

What happened:

from datetime import timedelta

delta = timedelta(seconds=60)

dt.optimize.z_order(
    ["user_id", "product"],
    max_spill_size=4194304000,
    min_commit_interval=delta,
    max_concurrent_tasks=1,
)

I am trying to execute z-order on the partitioned data. There are 65 partitions and each partition contains approx. 900 MB of data in approx. 16 parquet files with approx. 55 mb file size of each parquet. It results into following error

DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ResourcesExhausted("Failed to allocate additional 403718240 bytes for ExternalSorter[2] with 0 bytes already allocated - maximum available is 381425355").

I am new to deltalake and don't have much knowledge on how z_order work. Is it due to the large amount of data? I am trying to run it on my local laptop with limited resources.

@pyjads pyjads added the bug Something isn't working label Mar 13, 2024
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 24, 2024

@pyjads since you have a partitioned table you can run the optimize.z-order on each partition. You can use the partition_filters parameter for that

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Mar 24, 2024
@adriangb
Copy link
Contributor

adriangb commented May 12, 2024

Shouldn't delta-rs automatically be doing the z-order within partitions anyway since you can't z-order across partitions? And if a partition is too big to fit in memory, shouldn't it spill to disk?

Anecdotally spilling to disk does not seem to work, unless I set it to a very large value and spill to swap even a medium sized table can't be z-ordered.

@pyjads
Copy link
Author

pyjads commented Sep 25, 2024

@ion-elgreco Each partition is too big to be loaded completely in the memory. How can it be configured to prevent memory errors for large tables?

@ion-elgreco
Copy link
Collaborator

@ion-elgreco Each partition is too big to be loaded completely in the memory. How can it be configured to prevent memory errors for large tables?

There is a bug in datafusion that prevents this at the moment, I will have to find the related issue for this though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants