You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of snapshot creation in the Delta Lake kernel fails when requesting a version earlier than the latest checkpoint. While the try_new method appears to succeed and returns a snapshot with the correct version number, it does not represent the table state at the requested point in time.
Why it happens:
The root cause of this issue lies in how checkpoint and log file selection are handled during snapshot creation. The try_new method in Snapshot always uses the latest checkpoint metadata to list log files, regardless of the requested version. This approach leads to the following problems:
Incorrect checkpoint usage: The method uses a checkpoint newer than the requested version, leading to an incorrect starting point for the snapshot.
Improper file selection: It may include files from versions newer than the requested one or exclude files that should be part of the earlier version.
Impact:
This issue renders time travel queries and historical data analysis non-functional. Users attempting to view or analyze the table state at a specific point in the past will receive incorrect results, effectively breaking a key feature of Delta Lake.
Reproduction:
PR #322 provides a minimal test case demonstrating this issue. The test creates a snapshot for version 10 earlier than the latest checkpoint; you can find the test failure.
Potential Solution:
To address this issue, the snapshot creation logic needs to be modified. Here's a high-level approach:
In the try_new method of Snapshot:
a. First, read the latest checkpoint metadata.
b. If a specific version is requested and it's less than the latest checkpoint version:
Read all checkpoint metadata files.
Find the most recent checkpoint, which is the latest version.
c. Use this appropriate checkpoint to list log files.
Modify list_log_files_with_checkpoint to only include files up to the requested version.
Ensure only changes up to the requested version are applied when reconstructing the snapshot.
I'm waiting to hear the thoughts from the maintainers before submitting the fix.
The text was updated successfully, but these errors were encountered:
The current implementation of snapshot creation in the Delta Lake kernel fails when requesting a version earlier than the latest checkpoint. While the
try_new
method appears to succeed and returns a snapshot with the correct version number, it does not represent the table state at the requested point in time.Why it happens:
The root cause of this issue lies in how checkpoint and log file selection are handled during snapshot creation. The
try_new
method inSnapshot
always uses the latest checkpoint metadata to list log files, regardless of the requested version. This approach leads to the following problems:Impact:
This issue renders time travel queries and historical data analysis non-functional. Users attempting to view or analyze the table state at a specific point in the past will receive incorrect results, effectively breaking a key feature of Delta Lake.
Reproduction:
PR #322 provides a minimal test case demonstrating this issue. The test creates a snapshot for version 10 earlier than the latest checkpoint; you can find the test failure.
Potential Solution:
To address this issue, the snapshot creation logic needs to be modified. Here's a high-level approach:
In the
try_new
method ofSnapshot
:a. First, read the latest checkpoint metadata.
b. If a specific version is requested and it's less than the latest checkpoint version:
c. Use this appropriate checkpoint to list log files.
Modify
list_log_files_with_checkpoint
to only include files up to the requested version.Ensure only changes up to the requested version are applied when reconstructing the snapshot.
I'm waiting to hear the thoughts from the maintainers before submitting the fix.
The text was updated successfully, but these errors were encountered: