Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot creation fails for versions earlier than the latest checkpoint #323

Open
hackintoshrao opened this issue Sep 5, 2024 · 0 comments

Comments

@hackintoshrao
Copy link
Contributor

hackintoshrao commented Sep 5, 2024

The current implementation of snapshot creation in the Delta Lake kernel fails when requesting a version earlier than the latest checkpoint. While the try_new method appears to succeed and returns a snapshot with the correct version number, it does not represent the table state at the requested point in time.

Why it happens:
The root cause of this issue lies in how checkpoint and log file selection are handled during snapshot creation. The try_new method in Snapshot always uses the latest checkpoint metadata to list log files, regardless of the requested version. This approach leads to the following problems:

  1. Incorrect checkpoint usage: The method uses a checkpoint newer than the requested version, leading to an incorrect starting point for the snapshot.
  2. Improper file selection: It may include files from versions newer than the requested one or exclude files that should be part of the earlier version.

Impact:
This issue renders time travel queries and historical data analysis non-functional. Users attempting to view or analyze the table state at a specific point in the past will receive incorrect results, effectively breaking a key feature of Delta Lake.

Reproduction:
PR #322 provides a minimal test case demonstrating this issue. The test creates a snapshot for version 10 earlier than the latest checkpoint; you can find the test failure.

Potential Solution:
To address this issue, the snapshot creation logic needs to be modified. Here's a high-level approach:

  1. In the try_new method of Snapshot:
    a. First, read the latest checkpoint metadata.
    b. If a specific version is requested and it's less than the latest checkpoint version:

    • Read all checkpoint metadata files.
    • Find the most recent checkpoint, which is the latest version.

    c. Use this appropriate checkpoint to list log files.

  2. Modify list_log_files_with_checkpoint to only include files up to the requested version.

  3. Ensure only changes up to the requested version are applied when reconstructing the snapshot.

I'm waiting to hear the thoughts from the maintainers before submitting the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant