-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(duckdb): restore or add new method for consuming pyarrow.dataset
#10082
Comments
We're running into some issues with the intended replacement for the previous import ibis
import ibis.backends.duckdb
import pyarrow as pa
def attach_tables(backend, table):
return backend.create_view("test_table", ibis.memtable(table))
backend = ibis.duckdb.connect()
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
table = pa.Table.from_arrays([n_legs, animals], names=names)
ibis_view = attach_tables(backend, table)
ibis_view.execute() # This will raise an error in 9.5.0, but not in 9.4.0 The above works in 9.4.0 but breaks in 9.5.0 with the following error:
It makes sense why this breaks, given the changes in this (and related) PR: #10055 This does however introduce a breaking change effectively, as in the past the lifecycle of data registered as views in DuckDB was tied to the lifecycle of the DuckDB con instance, and now has to be managed manually, as the I can work around this by holding on to the references to memtable objects explicitly, but I wonder if there's perhaps someting that could be done in the |
Thanks for that extra context @acuitymd-filip, and sorry about the inadvertent breaking change. We really try to avoid those (without major version bumps), but things do slip through. I can think of a relatively straightforward way to handle passing the in-memory objects directly to |
Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable. |
The One thing we could do is if something not an It would be tricky, but probably doable to handle |
I did some poking at this. The |
I'm not sure why naming the memtable and not using The issue seems to be creating a view from an inline-constructed memtable. If you don't do that and instead name the memtable with the name you'd give to the view, then I don't think there's an issue, because those memtables now must be in scope to use them. |
That seems viable, just requires different work (which is fine). We don't support I think, especially, we need to ensure that DuckDB can still push down filters and projections |
Hey @acuitymd-filip -- if you want to check out #10206, that should allow for creating views (via memtable) of |
In #9666 we deprecated the
read_in_memory
method on the DuckDB backend in favor of creating amemtable
or passing an object directly tocreate_table
.This does leave out
pyarrow.dataset.Dataset
s, which we previously supported viaread_in_memory
. They have ato_table()
method, but materializing them may not be ideal, especially since they can point to (lists of) remote resources and DuckDB can (I believe) push filters down into the dataset reads to do column pruning and filtering.We could add a memtable wrapper for them, although the deferred aspect is a bit different from other memtable-able objects.
Whatever we decide on might also apply to pyarrow record-batch-readers.
The text was updated successfully, but these errors were encountered: