-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: arrow backed log replay and table state #2037
Conversation
7c0c7e0
to
25a2ca0
Compare
@roeap This looks great! I initial had concerns about on how to update from one snapshot to another and the potential costs. Assuming checkpoints are written every 10 commits, the total of the overall Seems like the primary trade off with this approach is a significant redaction in overall memory footprint but it comes with a potential of additional reads to the underlying store. |
|
||
/// A snapshot of a Delta table | ||
pub struct Snapshot { | ||
log_segment: LogSegment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any benefit from wrapping log_segment
in an Arc? If this will replace the table state struct then it be cloned a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shall see 😆.
IIRC, we tried to avoid cloning the table whenever possible b/c can be quite large, but then eventually caved and made it Clone
. A goal is to end up in a place where we can do guilt-free clones and (de-)serialization of the DeltaTable
and / or Snapshot
. Whether we are already getting there in this PR is another question 😆.
Exactly @Blajda. Especially for the non-query cases (vacuum etc) we may see more reads, and if not using the As you mentioned, if users take care of their tables (e.g. checkpoints...) then reading from parquet is the most interesting thing. Thus, the main thing I want to do as a immediate follow up, is to introduce some optimizations for parquet reads, which I hope will come in handy for the work in #2006 - this can make reads from checkpoint files very efficient. Recently I also stumbled across the fact, that datafusion now has build in capabilities for caching as well, which I hope we can explore down the line. The best strategy for us here may just be to introduce a thin caching layer in the log / object store to make those additional reads potentially irrelevant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a very high quality PR and the changes made to operations makes sense to me.
where | ||
V: SeqAccess<'de>, | ||
{ | ||
println!("eager: {:?}", "start"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to remove a print :)
@@ -46,7 +45,7 @@ pub(crate) struct DataArrowWriter { | |||
writer_properties: WriterProperties, | |||
buffer: ShareableBuffer, | |||
arrow_writer: ArrowWriter<ShareableBuffer>, | |||
partition_values: HashMap<String, Option<String>>, | |||
partition_values: BTreeMap<String, Scalar>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the change from a HashMap to BTree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the partition values, ordering matters in several places. Previously we were using a combination of the partition columns on metadata and the partition values HashMap. With this change we can rely on the insertion order and no longer need to track the values vector.
I am going to merge this because we need it merged, main is already broken 😒 So the repair test will need to get fixed shortly |
# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: R. Tyler Croy <[email protected]>
I actually used this functionality 😄 It was removed in delta-io#2037 but I think it's handy and worth keeping around :shrug:
I actually used this functionality 😄 It was removed in #2037 but I think it's handy and worth keeping around :shrug:
Description
This is still very much a work in progress, opening it up for visibility and discussion.
Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well.
To make the transition we introduce two new structs
Snapshot
- a half lazy version of the Snapshot, which only tries to getProtocol
&Metadata
actions ASAP. Of course these drive all our planning activities and without them there is not much we can do.EagerSnapshot
- An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the currentDeltaTable
APIs.One conceptually larger change is related to how we view the availability of information. Up until now
DeltaTableState
could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is theDeltaTable
, which now only carries the table configuration and theLogStore
. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available.This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for
DeltaTable
as this does not provide the information (and never has) that is al least required to execute a query.Some larger notable changes include:
DeltaTableMetadata
and always useMetadata
action.Major follow-ups:
log_store
andobject_store
. Currently we make use mostly ofObjectStore
inside the state handling. What we really use ishead
/list_from
/get
- my hope would be that we end up with a single abstraction...deltalake-test
crate now, this can be reconciled.Questions
Related Issue(s)
supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435
Documentation