Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data pond: expose readable datasets as dataframes and arrow tables #1507

Merged
merged 119 commits into from
Oct 8, 2024

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Jun 21, 2024

Description

As an alternative to the ibis integration, we are testing out wether we can create our own data reader with not too much effort that works across all destinations.

Ticket for followup work after this PR is here: https://github.com/orgs/dlt-hub/projects/9/views/1?pane=issue&itemId=80696433

TODO

  • Build dataset and relation interfaces (see @rudolfix comment below)
  • Extend DBApiCursorImpl to support arrow tables (some native cursors support arrow)
  • Ensure all native cursors that have native support for arrow and pandas forward this to DBApiCursorImpl
  • Expose prepopulated duckdb instance from filesystem somehow? Possibly via fs_client interface
  • Figure out default chunk sizes and a nice interface (some cursors / databases figure out their own chunk size such as snowflake, others only return chunks in vector sizes such as duckdb)
  • Ensure up to date docstrings on all new interface and methods

Copy link

netlify bot commented Jun 21, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit fb9a445
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6704fad0fe6dab0008567bbd

@sh-rp sh-rp linked an issue Jun 21, 2024 that may be closed by this pull request
sh-rp added 3 commits June 24, 2024 17:34
# Conflicts:
#	dlt/common/destination/reference.py
#	dlt/destinations/sql_client.py
#	dlt/pipeline/pipeline.py
@sh-rp sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 0e7b165 to 6dce626 Compare July 17, 2024 15:10
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sh-rp pls see my comments

dlt/destinations/impl/filesystem/filesystem.py Outdated Show resolved Hide resolved
dlt/destinations/impl/filesystem/filesystem.py Outdated Show resolved Hide resolved
composable_pipeline_1.py Outdated Show resolved Hide resolved
"""Add support for accessing data as arrow tables or pandas dataframes"""

@abstractmethod
def iter_df(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't explicitly expose Pandas dataframes. I would only expose Arrow data structures , because the user can call to_pandas on those.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorritsandbrink there are some destinations that have native pandas support and I think it would be cool to be able to expose those directly for the user

) -> Generator[DataFrame, None, None]: ...

@abstractmethod
def iter_arrow(
Copy link
Collaborator

@jorritsandbrink jorritsandbrink Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably a good idea to support pyarrow.RecordBatchReader alongside pyarrow.Table for larger-than-memory data.

Or, if possible, expose a pyarrow.Dataset, from which the user can create either a pyarrow.RecordBatchReader or pyarrow.Table.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: using Dataset possible allows us to "mount" any external table as a cursor that reads data so we do not need to download full table locally. we may be able to ie. mount snowflake or bigquery tables are duckdb tables

@sh-rp sh-rp changed the title [experiment] expose readable datasets as dataframes and arrow tables expose readable datasets as dataframes and arrow tables Aug 6, 2024
@sh-rp sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 9538229 to 13ec73b Compare August 6, 2024 14:18
dlt/destinations/impl/filesystem/sql_client.py Outdated Show resolved Hide resolved
dlt/destinations/impl/filesystem/sql_client.py Outdated Show resolved Hide resolved
# set up connection and dataset
self._existing_views: List[str] = [] # remember which views already where created

self.autocreate_required_views = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this flag and why it is set to False here? is querying INFORMATION_SCHEMA a problem? IMO we should fix execute_query below

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some bug in sqlglot that throws an internal exception when parsing the information schema query, and I think for now this method is better than catching all exceptions and ignore them in the sqlglot parsing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlglot was complaining that it cannot parse parametrized queries, which is fair... this was also a good opportunity to fix things so I'm skipping parsing in this case now

dlt/destinations/impl/filesystem/sql_client.py Outdated Show resolved Hide resolved
tests/load/test_read_interfaces.py Outdated Show resolved Hide resolved
tests/load/test_read_interfaces.py Outdated Show resolved Hide resolved

# now we can use the filesystemsql client to create the needed views
fs_client: Any = pipeline.destination_client()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interface is already there. it is called update_stored_schema on job client. in this case filesystem is a staging destination for duckdb and you create all tables as views, pass the credentials etc. maybe we need some convenience method that creates dataset instance out of such strucutre (so we take not just destination but also staging as input to dataset).

the nice thing is that all this duckdb related code that creates views and does permission handover could go to duckdb client.

#1692

this is a good followup ticket

dlt/pipeline/pipeline.py Show resolved Hide resolved
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is good now but we miss a few tests specific to sql_client for filesystem:

  1. make sure you can "open_connection" several times on sql_client
  2. pass external connection ie. to duckdb that is persisted and try to create a few views then use it from existing connection (or after reopen of the persistent db)
  3. test if we skip creating view for tables from "foreign" schemas (not dataset) ie. by querying a table with known table but with schema mismatch

ideal we'd move checks in test_read_interfaces.py done only for filesystem to the specific tests. we can do it later as well

@sh-rp
Copy link
Collaborator Author

sh-rp commented Oct 7, 2024

@rudolfix I think all is addressed in the additional commit I just made. I'm not quite sure what you mean with testing open_connection several times. I am testing this by opening the sql_client context on the filesystem a few times in a row now, but there is no test for parallel access, if that is what you need lmk.

@sh-rp sh-rp force-pushed the exp/1095-expose-readable-datasets branch from c41ecca to 631d50b Compare October 7, 2024 13:36
rudolfix
rudolfix previously approved these changes Oct 7, 2024
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sh-rp sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 9885c5a to 41926ae Compare October 7, 2024 18:32
rudolfix
rudolfix previously approved these changes Oct 8, 2024
@sh-rp sh-rp force-pushed the exp/1095-expose-readable-datasets branch from 8658fa8 to fb9a445 Compare October 8, 2024 09:26
@sh-rp sh-rp merged commit 4ee65a8 into devel Oct 8, 2024
61 checks passed
@sh-rp sh-rp deleted the exp/1095-expose-readable-datasets branch October 8, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

access data after load load as dataframes with ibis
4 participants