Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support attaching delta tables as catalogs #110

Merged
merged 9 commits into from
Nov 8, 2024

Conversation

samansmink
Copy link
Collaborator

@samansmink samansmink commented Oct 24, 2024

This draft PR adds support for attaching delta tables as catalogs.

This feature relies on several recent changes made to DuckDB recently. Therefore this PR is made to the feature branch of the delta extension.

How to use

With this PR, you can now do:

ATTACH './some/delta/table' AS my_delta_table (TYPE delta)
SELECT * FROM my_delta_table;

The attach statement allows setting the PIN_SNAPSHOT option, which will enforce caching the snapshot. Caching the snapshot will result in DuckDB only reading the snapshot once, which can offer a little performance boost at the cost of potentially querying an old snapshot:

ATTACH './some/delta/table' AS my_delta_table (TYPE delta, PIN_SNAPSHOT)
SELECT * FROM my_delta_table;

Performance implications

With this PR, using ATTACH should generally be as fast as, or faster, as using the delta_scan function. Note that even without using the PIN_SNAPSHOT function, using ATTACH can result in a significant performance increase when a table is scanned more than once in the same query, for example:

ATTACH './some/delta/table' AS my_delta_table (TYPE delta)
FROM my_delta_table UNION ALL FROM my_delta_table 

will only scan the delta metadata once, whereas

FROM delta_scan('./some/delta/table') UNION ALL FROM delta_scan('./some/delta/table')

will scan the metadata twice

Optimizations in this PR

Initial naive implementation resulting in a lot of snapshot metadata being read unnecessarily.

Caching the Snapshot in the Transaction

Snapshots are cached in transaction, this both ensures that every time a delta table is scanned within the same transaction, not only will they see the same version of the table, the delta metadata is reused instead of re-scanning the table.

Sharing kernel snapshots

The raw ffi::SharedSnapshot that the delta kernel returns, is wrapped in a thread-safe wrapper. This means that when making a copy of a DuckDB DeltaSnapshot object, we simply copy the reference to the thread-safe wrapper instead of reconstructing the snapshot

Not emitting a FileExpandResult

DeltaSnapshot::GetExpandResult will no longer materialize the file list to return FileExpandResult::NO_FILES, FileExpandResult::SINGLE_FILE or FileExpandResult::MULTIPLE_FILES.

TODOs

This PR is still draft because of some remaining TODOs

  • Fix the current regression in the parquet reader in duckdb/duckdb that is causing this CI to fail
  • Fix an issue in DuckDB's clients where rebinding is causing the bind of a delta snapshot to be called too many times

@djouallah
Copy link

what's the type of this item, is it a table or a view, that's seems like an external table to me ?
ATTACH './some/delta/table' AS my_delta_table (TYPE delta)

@samansmink
Copy link
Collaborator Author

@djouallah great question! its sortof a new thing: it's a catalog with a default schema and table:

see duckdb/duckdb#14118 where the feature is added to duckdb and
https://github.com/duckdb/duckdb_delta/blob/1261ea985c0b9707b3924bec2c64ddecd35efbb6/test/sql/dat/attach.test#L72 for how you can actually switch the default catalog to this delta catalog.

Note that this is the first step towards write support: through the newly added DeltaCatalog we can start adding appends, updates etc.

@samansmink samansmink marked this pull request as ready for review November 7, 2024 15:57
@samansmink
Copy link
Collaborator Author

Ci Failure is expected: it's due to a regression test comparing to a previous version that doesn't support the attach statement

@samansmink samansmink merged commit 49c902b into duckdb:feature Nov 8, 2024
18 of 19 checks passed
Mytherin added a commit to duckdb/duckdb that referenced this pull request Nov 14, 2024
Before this PR, various codepaths would internally use a separate call
to `Connection::Prepare` followed by `PreparedStatement::Execute` even
though they would be executed right after eachother. This could lead to
expensive rebinding in queries on tables in Catalogs like the new Delta
Catalog.

## Background

The Prepare - Execute model as DuckDB uses is it sensible when called by
a users, but not always the best option when they immediately follow
eachother. The reason for this is that in most cases, duckdb will be
running in so-called auto-commit mode. This mode means that DuckDB will
wrap each query in a transaction for automatically. What this mean the
the Prepare Execute model is that the prepare and execute will each run
in their own transaction. To properly understand lets consider the
following:

```c++
DuckDB db(nullptr);
Connection con1(db);
Connection con2(db);

// Create some test table
con->Query("CREATE TABLE a(i TINYINT)");
con->Query("INSERT INTO a VALUES (11), (12), (13)");

// We prepare our statement
auto prepared = con->Prepare("SELECT * FROM a WHERE i=$1");

// The table is altered in the meantime by a different connection
con2->Query("ALTER TABLE a ADD COLUMN b VARCHAR");

// Now during execution, our binding from prepare is incorrect! However DuckDB will rebind to fix this.
auto result = prepared->Execute(12);
```
So as we can see in the example rebinding is used to recover from
situations where the binding has become invalidated between preparing
and executing.

## The problem
Rebinding is in general a good thing: it allows recovery from situations
like the one above. However, it can get expensive. For the (draft)
implementation of the new Delta Catalog feature (see
duckdb/duckdb-delta#110), I would like to ensure
that we don't bind delta tables more than absolutely necessary, because
we currently can't efficiently verify if the table version we have read
in the bind phase is still the latest during execution phase. This means
for every delta table you would need to bind twice (unless we use
aggressive and user-unfriendly caching mechanisms). I verified that the
sqlite extension also suffer from the same problem.

## Solution
The fix is to ensure we use a single transaction across the Prepare and
Execute step where possible. This means we need to be careful with
implementing `Query` functions in clients using the `Prepare` and
`Execute`, and instead going through `Connection::SendQuery`,
`Connection::Query`, and `Connection::PendingQuery` as much as possible.

This allows Catalogs to cache state in their transactions avoiding
expensive rebinds. In the case of Delta this allows me to cache the
Delta Snapshot in the DeltaTransaction: a safe and unobtrusive way to
ensure the snapshot stays pinned for the duration of the transaction.
Even if a rebind would be triggered, the delta catalog can simply use
the cached snapshot and avoid redoing IO.

In the end I added the following in this PR:
- Added new `Connection::PendingQuery` methods that allow passing
parameters
- Python clients `Connection::Execute` by using the new `PendingQuery`
method
- Fixed the CLI by switching to use `PendingQuery` instead of prepare
execute when there are no parameters in the statement
- Fixed C++ APIs variadic `Connection::QueryParamsRecursive` by using
new `PendingQuery` method

## Testing
I wrote some minor extra tests for the C++ API, I've also built the
aforementioned draft delta catalog PR with this branch and can confirm
that this allows us fetch the delta snapshot only once per table per
query.

## Todo's
- double check other clients
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants