feat: arrow backed log replay and table state #2037

roeap · 2024-01-05T18:35:06Z

Description

This is still very much a work in progress, opening it up for visibility and discussion.

Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well.

To make the transition we introduce two new structs

Snapshot - a half lazy version of the Snapshot, which only tries to get Protocol & Metadata actions ASAP. Of course these drive all our planning activities and without them there is not much we can do.
EagerSnapshot - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current DeltaTable APIs.

One conceptually larger change is related to how we view the availability of information. Up until now DeltaTableState could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the DeltaTable, which now only carries the table configuration and the LogStore. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available.

This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for DeltaTable as this does not provide the information (and never has) that is al least required to execute a query.

Some larger notable changes include:

remove DeltaTableMetadata and always use Metadata action.
arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions.

Major follow-ups:

(pre-0.17) review integration with log_store and object_store. Currently we make use mostly of ObjectStore inside the state handling. What we really use is head / list_from / get - my hope would be that we end up with a single abstraction...
test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the deltalake-test crate now, this can be reconciled.
...
do more processing on borrowed data ...
perform file-heavy operations on arrow data
update checkpoint writing to leverage new state handling and arrow ...
switch to exposing URL in public APIs

Questions

should paths be percent-encoded when written to checkpoint?

Related Issue(s)

supersedes: #454
supersedes: #1837
closes: #1776
closes: #425 (should also be addressed in the current implementation)
closes: #288 (multi-part checkpoints are deprecated)
related: #435

Documentation

Blajda · 2024-01-06T03:11:38Z

@roeap This looks great! I initial had concerns about on how to update from one snapshot to another and the potential costs. Assuming checkpoints are written every 10 commits, the total of the overall LogSegment structure will be fairly small so the cost of cloning a segment will be relatively cheap.

Seems like the primary trade off with this approach is a significant redaction in overall memory footprint but it comes with a potential of additional reads to the underlying store.

Blajda · 2024-01-06T03:14:48Z

crates/deltalake-core/src/kernel/snapshot/mod.rs

+
+/// A snapshot of a Delta table
+pub struct Snapshot {
+    log_segment: LogSegment,


Will there be any benefit from wrapping log_segment in an Arc? If this will replace the table state struct then it be cloned a bit.

We shall see 😆.

IIRC, we tried to avoid cloning the table whenever possible b/c can be quite large, but then eventually caved and made it Clone. A goal is to end up in a place where we can do guilt-free clones and (de-)serialization of the DeltaTable and / or Snapshot. Whether we are already getting there in this PR is another question 😆.

roeap · 2024-01-06T07:02:59Z

Seems like the primary trade off with this approach is a significant redaction in overall memory footprint but it comes with a potential of additional reads to the underlying store.

Exactly @Blajda. Especially for the non-query cases (vacuum etc) we may see more reads, and if not using the EagerSnapshot we may also do some duplicate reading. In this PR we are not yet leveraging some more optimization that can be done with regards to that.

As you mentioned, if users take care of their tables (e.g. checkpoints...) then reading from parquet is the most interesting thing. Thus, the main thing I want to do as a immediate follow up, is to introduce some optimizations for parquet reads, which I hope will come in handy for the work in #2006 - this can make reads from checkpoint files very efficient.

Recently I also stumbled across the fact, that datafusion now has build in capabilities for caching as well, which I hope we can explore down the line. The best strategy for us here may just be to introduce a thin caching layer in the log / object store to make those additional reads potentially irrelevant.

Blajda

Overall a very high quality PR and the changes made to operations makes sense to me.

Blajda · 2024-01-20T22:21:02Z

crates/deltalake-core/src/kernel/snapshot/serde.rs

+    where
+        V: SeqAccess<'de>,
+    {
+        println!("eager: {:?}", "start");


Forgot to remove a print :)

Blajda · 2024-01-20T22:24:23Z

crates/deltalake-core/src/writer/json.rs

@@ -46,7 +45,7 @@ pub(crate) struct DataArrowWriter {
    writer_properties: WriterProperties,
    buffer: ShareableBuffer,
    arrow_writer: ArrowWriter<ShareableBuffer>,
-    partition_values: HashMap<String, Option<String>>,
+    partition_values: BTreeMap<String, Scalar>,


Why the change from a HashMap to BTree?

For the partition values, ordering matters in several places. Previously we were using a combination of the partition columns on metadata and the partition values HashMap. With this change we can rely on the insertion order and no longer need to track the values vector.

rtyler · 2024-01-23T04:39:55Z

I am going to merge this because we need it merged, main is already broken 😒 So the repair test will need to get fixed shortly

# Description This is still very much a work in progress, opening it up for visibility and discussion. Finally I do hope that we can make the switch to arrow based log handling. Aside from hopefully advantages in the memory footprint, I also believe it opens us up to many future optimizations as well. To make the transition we introduce two new structs - `Snapshot` - a half lazy version of the Snapshot, which only tries to get `Protocol` & `Metadata` actions ASAP. Of course these drive all our planning activities and without them there is not much we can do. - `EagerSnapshot` - An intermediary structure, which eagerly loads file actions and does log replay to serve as a compatibility laver for the current `DeltaTable` APIs. One conceptually larger change is related to how we view the availability of information. Up until now `DeltaTableState` could be initialized empty, containing no useful information for any code to work with. State (snapshots) now always needs to be created valid. The thing that may not yet be initialized is the `DeltaTable`, which now only carries the table configuration and the `LogStore`. the state / snapshot is now optional. Consequently all code that works against a snapshot no longer needs to handle that matadata / schema etc may not be available. This also has implications for the datafusion integration. We already are working against snapshots mostly, but should abolish most traits implemented for `DeltaTable` as this does not provide the information (and never has) that is al least required to execute a query. Some larger notable changes include: * remove `DeltaTableMetadata` and always use `Metadata` action. * arrow and parquet are now required, as such the features got removed. Personalyl I would also argue, that if you cannot read checkpoints, you cannot read delta tables :). - so hopefully users weren't using arrow-free versions. ### Major follow-ups: * (pre-0.17) review integration with `log_store` and `object_store`. Currently we make use mostly of `ObjectStore` inside the state handling. What we really use is `head` / `list_from` / `get` - my hope would be that we end up with a single abstraction... * test cleanup - we are currently dealing with test flakiness and have several approaches to scaffolding tests. SInce we have the `deltalake-test` crate now, this can be reconciled. * ... * do more processing on borrowed data ... * perform file-heavy operations on arrow data * update checkpoint writing to leverage new state handling and arrow ... * switch to exposing URL in public APIs ## Questions * should paths be percent-encoded when written to checkpoint? # Related Issue(s) supersedes: delta-io#454 supersedes: delta-io#1837 closes: delta-io#1776 closes: delta-io#425 (should also be addressed in the current implementation) closes: delta-io#288 (multi-part checkpoints are deprecated) related: delta-io#435 # Documentation  --------- Co-authored-by: R. Tyler Croy <[email protected]>

I actually used this functionality 😄 It was removed in delta-io#2037 but I think it's handy and worth keeping around :shrug:

I actually used this functionality 😄 It was removed in #2037 but I think it's handy and worth keeping around :shrug:

github-actions bot added binding/rust Issues for the Rust crate crate/core labels Jan 5, 2024

roeap force-pushed the log-replay branch 3 times, most recently from 7c0c7e0 to 25a2ca0 Compare January 5, 2024 20:43

Blajda reviewed Jan 6, 2024

View reviewed changes

github-actions bot added binding/python Issues for the Python package delta-inspect labels Jan 6, 2024

roeap added 18 commits January 7, 2024 22:43

feat: read LogSegment's

6e5a898

chore: update kernel types

8218634

feat: read log commit files

49c7ff2

feat: basic log replay

978c468

feat: basic checkpoint file reading

2cdbe54

feat: protocol & metadata query

e01ec53

feat: add eager snapshot

14dfc48

test: reconcile snapshot tests

51ca3ac

feat: update log segement and eager snapshot

e721531

chore: clippy

56272fe

feat: read commit info actions

75ea658

refactor: update state.merge call sites

2d14a63

refactor: always use metadata action

54f49ba

feat: read tombstones from log

7cc1f04

refactor: consume async tombstones

a71e047

feat: make snapshots serializable

cfa647a

refactor: always use Metadata action

5e86161

feat: use EagerSnapshot

a2fc20f

roeap force-pushed the log-replay branch from 22b6e4d to a2fc20f Compare January 7, 2024 21:50

chore: remove parquet and arrow feature

34153f0

roeap force-pushed the log-replay branch from 826e0cc to 5415073 Compare January 13, 2024 22:51

roeap added 3 commits January 15, 2024 17:15

feat: stats and partition value handling

f072a00

feat: improve logical file handling

2d9a5ea

fix: test expectations

9061f4f

roeap force-pushed the log-replay branch from 27d9712 to 9061f4f Compare January 15, 2024 23:10

Merge branch 'main' into log-replay

6755707

roeap mentioned this pull request Jan 16, 2024

fix: properly deserialize percent-encoded file paths of Remove actions, to make sure tombstone and file paths match #2035

Merged

roeap added 3 commits January 16, 2024 21:53

fix: minor s3 test fixes

729a0d1

refactor: simplify kernel module structure

69db52d

Merge branch 'main' into log-replay

2a296c7

roeap marked this pull request as ready for review January 16, 2024 22:55

roeap requested review from wjones127 and fvaleye as code owners January 16, 2024 22:55

Blajda reviewed Jan 20, 2024

View reviewed changes

rtyler added 2 commits January 22, 2024 16:19

Merge branch 'main' into log-replay

3cba9f4

Correct an error from conflict resolution

72ea9b1

rtyler approved these changes Jan 23, 2024

View reviewed changes

rtyler enabled auto-merge (rebase) January 23, 2024 02:10

rtyler disabled auto-merge January 23, 2024 04:39

rtyler merged commit 1a984ce into delta-io:main Jan 23, 2024
21 of 23 checks passed

roeap deleted the log-replay branch January 23, 2024 21:10

This was referenced Jan 26, 2024

feat: arrow based table state and checkpoint handling #1837

Closed

WIP: Record Batch support for delta log stats_parsed (#435) #454

Closed

qinix mentioned this pull request Mar 25, 2024

get_app_transaction_version() returns wrong result #2340

Closed

Tom-Newton mentioned this pull request Aug 12, 2024

Transaction log parsing performance regression #2760

Closed

rtyler added a commit to rtyler/delta-rs that referenced this pull request Aug 13, 2024

feat: restore the TryFrom for DeltaTablePartition

fa3d808

I actually used this functionality 😄 It was removed in delta-io#2037 but I think it's handy and worth keeping around :shrug:

rtyler mentioned this pull request Aug 13, 2024

feat: restore the TryFrom for DeltaTablePartition #2767

Merged

github-merge-queue bot pushed a commit that referenced this pull request Aug 13, 2024

feat: restore the TryFrom for DeltaTablePartition

a0f9c18

I actually used this functionality 😄 It was removed in #2037 but I think it's handy and worth keeping around :shrug:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: arrow backed log replay and table state #2037

feat: arrow backed log replay and table state #2037

roeap commented Jan 5, 2024 •

edited

Loading

Blajda commented Jan 6, 2024

Blajda Jan 6, 2024 •

edited

Loading

roeap Jan 6, 2024

roeap commented Jan 6, 2024 •

edited

Loading

Blajda left a comment

Blajda Jan 20, 2024

Blajda Jan 20, 2024

roeap Jan 21, 2024 •

edited

Loading

rtyler commented Jan 23, 2024

feat: arrow backed log replay and table state #2037

feat: arrow backed log replay and table state #2037

Conversation

roeap commented Jan 5, 2024 • edited Loading

Description

Major follow-ups:

Questions

Related Issue(s)

Documentation

Blajda commented Jan 6, 2024

Blajda Jan 6, 2024 • edited Loading

Choose a reason for hiding this comment

roeap Jan 6, 2024

Choose a reason for hiding this comment

roeap commented Jan 6, 2024 • edited Loading

Blajda left a comment

Choose a reason for hiding this comment

Blajda Jan 20, 2024

Choose a reason for hiding this comment

Blajda Jan 20, 2024

Choose a reason for hiding this comment

roeap Jan 21, 2024 • edited Loading

Choose a reason for hiding this comment

rtyler commented Jan 23, 2024

roeap commented Jan 5, 2024 •

edited

Loading

Blajda Jan 6, 2024 •

edited

Loading

roeap commented Jan 6, 2024 •

edited

Loading

roeap Jan 21, 2024 •

edited

Loading