Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: arrow backed log replay and table state #2037

Merged
merged 32 commits into from
Jan 23, 2024
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Jan 5, 2024

Description

This is still very much a work in progress, opening it up for visibility and discussion.

Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs

  • Snapshot - a half lazy version of the Snapshot, which only tries to get Protocol & Metadata actions ASAP. Of course these drive all our planning activities and without them there is not much we can do.
  • EagerSnapshot - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current DeltaTable APIs.

One conceptually larger change is related to how we view the availability of information. Up until now DeltaTableState could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the DeltaTable, which now only carries the table configuration and the LogStore. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for DeltaTable as this does not provide the information (and never has) that is al least required to execute a query.

Some larger notable changes include:

  • remove DeltaTableMetadata and always use Metadata action.
  • arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions.

Major follow-ups:

  • (pre-0.17) review integration with log_store and object_store. Currently we make use mostly of ObjectStore inside the state handling. What we really use is head / list_from / get - my hope would be that we end up with a single abstraction...
  • test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the deltalake-test crate now, this can be reconciled.
  • ...
  • do more processing on borrowed data ...
  • perform file-heavy operations on arrow data
  • update checkpoint writing to leverage new state handling and arrow ...
  • switch to exposing URL in public APIs

Questions

  • should paths be percent-encoded when written to checkpoint?

Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

Documentation

@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Jan 5, 2024
@roeap roeap force-pushed the log-replay branch 3 times, most recently from 7c0c7e0 to 25a2ca0 Compare January 5, 2024 20:43
@Blajda
Copy link
Collaborator

Blajda commented Jan 6, 2024

@roeap This looks great! I initial had concerns about on how to update from one snapshot to another and the potential costs. Assuming checkpoints are written every 10 commits, the total of the overall LogSegment structure will be fairly small so the cost of cloning a segment will be relatively cheap.

Seems like the primary trade off with this approach is a significant redaction in overall memory footprint but it comes with a potential of additional reads to the underlying store.


/// A snapshot of a Delta table
pub struct Snapshot {
log_segment: LogSegment,
Copy link
Collaborator

@Blajda Blajda Jan 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any benefit from wrapping log_segment in an Arc? If this will replace the table state struct then it be cloned a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shall see 😆.

IIRC, we tried to avoid cloning the table whenever possible b/c can be quite large, but then eventually caved and made it Clone. A goal is to end up in a place where we can do guilt-free clones and (de-)serialization of the DeltaTable and / or Snapshot. Whether we are already getting there in this PR is another question 😆.

@roeap
Copy link
Collaborator Author

roeap commented Jan 6, 2024

Seems like the primary trade off with this approach is a significant redaction in overall memory footprint but it comes with a potential of additional reads to the underlying store.

Exactly @Blajda. Especially for the non-query cases (vacuum etc) we may see more reads, and if not using the EagerSnapshot we may also do some duplicate reading. In this PR we are not yet leveraging some more optimization that can be done with regards to that.

As you mentioned, if users take care of their tables (e.g. checkpoints...) then reading from parquet is the most interesting thing. Thus, the main thing I want to do as a immediate follow up, is to introduce some optimizations for parquet reads, which I hope will come in handy for the work in #2006 - this can make reads from checkpoint files very efficient.

Recently I also stumbled across the fact, that datafusion now has build in capabilities for caching as well, which I hope we can explore down the line. The best strategy for us here may just be to introduce a thin caching layer in the log / object store to make those additional reads potentially irrelevant.

@github-actions github-actions bot added binding/python Issues for the Python package delta-inspect labels Jan 6, 2024
@roeap roeap marked this pull request as ready for review January 16, 2024 22:55
Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a very high quality PR and the changes made to operations makes sense to me.

where
V: SeqAccess<'de>,
{
println!("eager: {:?}", "start");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to remove a print :)

@@ -46,7 +45,7 @@ pub(crate) struct DataArrowWriter {
writer_properties: WriterProperties,
buffer: ShareableBuffer,
arrow_writer: ArrowWriter<ShareableBuffer>,
partition_values: HashMap<String, Option<String>>,
partition_values: BTreeMap<String, Scalar>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from a HashMap to BTree?

Copy link
Collaborator Author

@roeap roeap Jan 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the partition values, ordering matters in several places. Previously we were using a combination of the partition columns on metadata and the partition values HashMap. With this change we can rely on the insertion order and no longer need to track the values vector.

@rtyler rtyler enabled auto-merge (rebase) January 23, 2024 02:10
@rtyler rtyler disabled auto-merge January 23, 2024 04:39
@rtyler
Copy link
Member

rtyler commented Jan 23, 2024

I am going to merge this because we need it merged, main is already broken 😒 So the repair test will need to get fixed shortly

@rtyler rtyler merged commit 1a984ce into delta-io:main Jan 23, 2024
21 of 23 checks passed
@roeap roeap deleted the log-replay branch January 23, 2024 21:10
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this pull request Feb 2, 2024
# Description

This is still very much a work in progress, opening it up for visibility
and discussion.

Finally I do hope that we can make the switch to arrow based log
handling. Aside from hopefully advantages in the memory footprint, I
also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs 

- `Snapshot` - a half lazy version of the Snapshot, which only tries to
get `Protocol` & `Metadata` actions ASAP. Of course these drive all our
planning activities and without them there is not much we can do.
- `EagerSnapshot` - An intermediary structure, which eagerly loads file
actions and does log replay to serve as a compatibility laver for the
current `DeltaTable` APIs.

One conceptually larger change is related to how we view the
availability of information. Up until now `DeltaTableState` could be
initialized empty, containing no useful information for any code to work
with. State (snapshots) now always needs to be created valid. The thing
that may not yet be initialized is the `DeltaTable`, which now only
carries the table configuration and the `LogStore`. the state / snapshot
is now optional. Consequently all code that works against a snapshot no
longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already
are working against snapshots mostly, but should abolish most traits
implemented for `DeltaTable` as this does not provide the information
(and never has) that is al least required to execute a query.

Some larger notable changes include:

* remove `DeltaTableMetadata` and always use `Metadata` action.
* arrow and parquet are now required, as such the features got removed.
Personalyl I would also argue, that if you cannot read checkpoints, you
cannot read delta tables :). - so hopefully users weren't using
arrow-free versions.

### Major follow-ups:

* (pre-0.17) review integration with `log_store` and `object_store`.
Currently we make use mostly of `ObjectStore` inside the state handling.
What we really use is `head` / `list_from` / `get` - my hope would be
that we end up with a single abstraction...
* test cleanup - we are currently dealing with test flakiness and have
several approaches to scaffolding tests. SInce we have the
`deltalake-test` crate now, this can be reconciled.
* ...
* do more processing on borrowed data ...
* perform file-heavy operations on arrow data
* update checkpoint writing to leverage new state handling and arrow ...
* switch to exposing URL in public APIs

## Questions

* should paths be percent-encoded when written to checkpoint?

# Related Issue(s)

supersedes: delta-io#454
supersedes: delta-io#1837
closes: delta-io#1776
closes: delta-io#425 (should also be addressed in the current implementation)
closes: delta-io#288 (multi-part checkpoints are deprecated)
related: delta-io#435

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <[email protected]>
rtyler added a commit to rtyler/delta-rs that referenced this pull request Aug 13, 2024
I actually used this functionality 😄

It was removed in delta-io#2037 but I think it's handy and worth keeping around
:shrug:
github-merge-queue bot pushed a commit that referenced this pull request Aug 13, 2024
I actually used this functionality 😄

It was removed in #2037 but I think it's handy and worth keeping around
:shrug:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core delta-inspect
Projects
None yet
4 participants