Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(duckdb): restore or add new method for consuming pyarrow.dataset #10082

Closed
gforsyth opened this issue Sep 10, 2024 · 8 comments · Fixed by #10206
Closed

feat(duckdb): restore or add new method for consuming pyarrow.dataset #10082

gforsyth opened this issue Sep 10, 2024 · 8 comments · Fixed by #10206
Labels
duckdb The DuckDB backend
Milestone

Comments

@gforsyth
Copy link
Member

In #9666 we deprecated the read_in_memory method on the DuckDB backend in favor of creating a memtable or passing an object directly to create_table.

This does leave out pyarrow.dataset.Datasets, which we previously supported via read_in_memory. They have a to_table() method, but materializing them may not be ideal, especially since they can point to (lists of) remote resources and DuckDB can (I believe) push filters down into the dataset reads to do column pruning and filtering.

We could add a memtable wrapper for them, although the deferred aspect is a bit different from other memtable-able objects.
Whatever we decide on might also apply to pyarrow record-batch-readers.

@acuitymd-filip
Copy link

acuitymd-filip commented Sep 12, 2024

We're running into some issues with the intended replacement for the previous read_in_memory functionality in 9.5.0. Here's a quick reproduction of the issue:

import ibis
import ibis.backends.duckdb
import pyarrow as pa


def attach_tables(backend, table):
    return backend.create_view("test_table", ibis.memtable(table))


backend = ibis.duckdb.connect()
n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]
table = pa.Table.from_arrays([n_legs, animals], names=names)

ibis_view = attach_tables(backend, table)
ibis_view.execute()  # This will raise an error in 9.5.0, but not in 9.4.0

The above works in 9.4.0 but breaks in 9.5.0 with the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/filip/code/ibis-memtable-reproduction/src/ibis_memtable_reproduction/__main__.py", line 19, in <module>
    ibis_view.execute()  # This will raise an error in 9.5.0, but not in 9.4.0
    ^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/expr/types/core.py", line 396, in execute
    return self._find_backend(use_default=True).execute(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1444, in execute
    table = self._to_duckdb_relation(expr, params=params, limit=limit).arrow()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/filip/code/ibis-memtable-reproduction/.venv/lib/python3.12/site-packages/ibis/backends/duckdb/__init__.py", line 1377, in _to_duckdb_relation
    return self.con.sql(sql)
           ^^^^^^^^^^^^^^^^^
duckdb.duckdb.CatalogException: Catalog Error: Table with name ibis_pyarrow_memtable_r64pwlzyvfhflg2rexyz6hgutq does not exist!
Did you mean "sqlite_temp_master"?

It makes sense why this breaks, given the changes in this (and related) PR: #10055

This does however introduce a breaking change effectively, as in the past the lifecycle of data registered as views in DuckDB was tied to the lifecycle of the DuckDB con instance, and now has to be managed manually, as the memtable references will get GCd after they go out of scope, regardless of whether they have been registered as views with DuckDB backend or not.

I can work around this by holding on to the references to memtable objects explicitly, but I wonder if there's perhaps someting that could be done in the create_view method that would allow Ibis to associate an instance of memtable with the instance of the backend that's using it?

@gforsyth
Copy link
Member Author

Thanks for that extra context @acuitymd-filip, and sorry about the inadvertent breaking change. We really try to avoid those (without major version bumps), but things do slip through.

I can think of a relatively straightforward way to handle passing the in-memory objects directly to create_view (without wrapping in memtable), might be a bit kludgy. I'll take a look.

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2024

Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.

@gforsyth
Copy link
Member Author

Why wouldn't passing the name argument to memtable work? That's no different from creating a view of a memtable.

The memtable still gets GC'd (if defined as it is in the example above) -- this is a non-issue in create_table because the data are copied.

One thing we could do is if something not an ir.Expr is passed to create_view, we can register it via the (private) _read_in_memory dispatcher -- that would create the view of the object without the GC behavior cleaning up the underlying object (no longer a memtable).

It would be tricky, but probably doable to handle overwrite=True in that context -- I don't know how we can respect custom catalog.database locations for registration, though

@gforsyth
Copy link
Member Author

I did some poking at this.

The register method for creating a view from an existing in-memory Python object always creates a view in the temp catalog -- it's always temp and so it always goes in the temp catalog. So I think we just raise if someone passes in a non ir.Expr and tries to specify database=. If they need something persistent, that should go through create_table anyway, which does handle this -- and if they really need a view in a particular catalog, they can create a view of the view (although this won't survive longer than the existing session because the intermediate temporary view from register will get dropped)

@cpcloud
Copy link
Member

cpcloud commented Sep 18, 2024

I'm not sure why naming the memtable and not using create_view doesn't solve the scoping problem here.

The issue seems to be creating a view from an inline-constructed memtable. If you don't do that and instead name the memtable with the name you'd give to the view, then I don't think there's an issue, because those memtables now must be in scope to use them.

@gforsyth
Copy link
Member Author

gforsyth commented Sep 18, 2024

That seems viable, just requires different work (which is fine). We don't support dataset as inputs to memtable, and there are some semantics we'd have to sort out around consuming / materializing the (potentially remote) data inside the dataset.

I think, especially, we need to ensure that DuckDB can still push down filters and projections

@gforsyth
Copy link
Member Author

Hey @acuitymd-filip -- if you want to check out #10206, that should allow for creating views (via memtable) of pyarrow.dataset.Datasets and pushdown from DuckDB is working.

@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Sep 26, 2024
@github-actions github-actions bot added this to the 10.0 milestone Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb The DuckDB backend
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants