-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Add implementation note on current-snapshot-id
#12334
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Russell Spitzer <[email protected]>
* Add spec notes for snapshot id assignment * Add launage for recommended generation of snapshot id --------- Co-authored-by: Fokko Driesprong <[email protected]>
|
||
The reference implementation uses a type 4 uuid and XORs the 4 most significant bytes with the 4 least significant bytes then ANDs with the maximum long value to arrive at a pseudo-random snapshot id with a low probability of collision. | ||
|
||
The Java implementation writes `-1` for "no current snapshot" with V1 and V2 tables and considers this equivalent to omitted or `null`. This has never been formalized in the spec but for compatibility, other implementations can accept `-1` as `null`. The Java implementation will no longer write `-1` and will use `null` for "no current snapshot" for all tables with a version greater than or equal to V3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little bit confusing to see The reference implementation
and The Java implementation
at the same time. What about consolidating them with The Java reference implementation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, maybe we should just replace reference implamentation with java implementation?
@@ -1754,6 +1754,14 @@ Snapshot summary can include metrics fields to track numeric stats of the snapsh | |||
| **`engine-name`** | "spark" | Name of the engine that created the snapshot | | |||
| **`engine-version`** | "3.5.4" | Version of the engine that created the snapshot | | |||
|
|||
### Assignment of Snapshot IDs and `current-snapshot-id` | |||
|
|||
Writers should produce positive values for snapshot ids in a manner that minimizes the probability of id collisions and should verify the id does not conflict with existing snapshots. Producing snapshot ids based on timestamps alone is not recommended as it increases the potential for collisions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is generating a positive value a requirement here? It seems that it might be more of than an implementation note that "-1" should be reserved because it has special meaning prior to V3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing in the spec that that explicitly prohibits negative values, but the approach by the reference implementation intentionally avoids negative snapshot ids. There is commentary on this in #57 and #2506, so I think there is an expectation even if not explicitly called out.
It's not entirely clear why other than negative snapshots is somewhat awkward. Originally, it was just System.currentTimeMillis()
(which resulted in collisions), so I think just historically positive.
See for context: https://lists.apache.org/thread/gqqsnww6nqc50pddwn29blzghmb0m0h3