Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransactionError::VersionAlreadyExists needs exponential backoff with commit peeking #3066

Open
rtyler opened this issue Dec 17, 2024 · 0 comments · May be fixed by #3067
Open

TransactionError::VersionAlreadyExists needs exponential backoff with commit peeking #3066

rtyler opened this issue Dec 17, 2024 · 0 comments · May be fixed by #3067
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working storage/aws AWS S3 storage related
Milestone

Comments

@rtyler
Copy link
Member

rtyler commented Dec 17, 2024

Environment

Delta-rs version: irrelephant 🐘
Binding: both


Bug

What happened:

For systems with high write throughput, and therefore version contention, conflicting writes can result in VersionAlreadyExists being raised. The retry logic in the PreparedCommit operation immediately and blindly retries the operation with state_version +1 rather than peeking at the next commit version from the table.

This can result in logs such as:

LockClientError::VersionAlreadyExists(47804)
LockClientError::VersionAlreadyExists(47805)
LockClientError::VersionAlreadyExists(47806)
LockClientError::VersionAlreadyExists(47807)
LockClientError::VersionAlreadyExists(47808)
LockClientError::VersionAlreadyExists(47809)
LockClientError::VersionAlreadyExists(47810)
LockClientError::VersionAlreadyExists(47811)
LockClientError::VersionAlreadyExists(47812)
LockClientError::VersionAlreadyExists(47813)
LockClientError::VersionAlreadyExists(47814)
LockClientError::VersionAlreadyExists(47815)
Successfully flushed v47816 to Delta table 

Basically there are two problems:

  • The retry logic is a while true {} without any pause/sleep in between retries, so it's very easy for the retry to hit the max retry limit (hardcoded to 15 for AWS)
  • The retry logic foolishly assumes that the version of the currently loaded state is close enough to the latest version and attempts to create a version number of loaded + 1 which is almost 100% guaranteed to be wrong if there is more than one concurrent writing process. This causes additional unnecessary retries, since the aforementioned while true loop will move the attempted version number up by 1 from the loaded state rather than peeking at the latest commit in the table and attempting a commit with peeked_version + 1

What you expected to happen:

I would expect the retry to take peeked_version + 1 at a bare minimum!

How to reproduce it:

Use more than one concurrent writer 😆

More details:

@rtyler rtyler added bug Something isn't working binding/rust Issues for the Rust crate storage/aws AWS S3 storage related labels Dec 17, 2024
@rtyler rtyler added this to the v0.23 milestone Dec 17, 2024
@rtyler rtyler self-assigned this Dec 17, 2024
rtyler added a commit to rtyler/delta-rs that referenced this issue Dec 18, 2024
There are high-concurrency scenarios where multiple writers may have
committed versions to the log since the current process' state was
loaded. Because of the `snapshot.version() + 1` and retry logic in the
transaction commit operation, this would result in failures if more than
15 versions had been committed since the table state had been loaded.

This approach fetches the latest known version on the table prior to
committing the transaction, and if the latest version has shifted, it
will perform a conflict check with the versions that have since been
committed.

Fixes delta-io#3066

Signed-off-by: R. Tyler Croy <[email protected]>
Sponsored-by: Scribd Inc
rtyler added a commit to rtyler/delta-rs that referenced this issue Dec 19, 2024
There are high-concurrency scenarios where multiple writers may have
committed versions to the log since the current process' state was
loaded. Because of the `snapshot.version() + 1` and retry logic in the
transaction commit operation, this would result in failures if more than
15 versions had been committed since the table state had been loaded.

This approach fetches the latest known version on the table prior to
committing the transaction, and if the latest version has shifted, it
will perform a conflict check with the versions that have since been
committed.

Fixes delta-io#3066

Signed-off-by: R. Tyler Croy <[email protected]>
Sponsored-by: Scribd Inc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working storage/aws AWS S3 storage related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant