-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data pond: expose readable datasets as dataframes and arrow tables #1507
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
…ystem dataframe implementation
# Conflicts: # dlt/common/destination/reference.py # dlt/destinations/sql_client.py # dlt/pipeline/pipeline.py
0e7b165
to
6dce626
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sh-rp pls see my comments
dlt/common/destination/reference.py
Outdated
"""Add support for accessing data as arrow tables or pandas dataframes""" | ||
|
||
@abstractmethod | ||
def iter_df( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't explicitly expose Pandas dataframes. I would only expose Arrow data structures , because the user can call to_pandas
on those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorritsandbrink there are some destinations that have native pandas support and I think it would be cool to be able to expose those directly for the user
dlt/common/destination/reference.py
Outdated
) -> Generator[DataFrame, None, None]: ... | ||
|
||
@abstractmethod | ||
def iter_arrow( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably a good idea to support pyarrow.RecordBatchReader
alongside pyarrow.Table
for larger-than-memory data.
Or, if possible, expose a pyarrow.Dataset
, from which the user can create either a pyarrow.RecordBatchReader
or pyarrow.Table
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: using Dataset possible allows us to "mount" any external table as a cursor that reads data so we do not need to download full table locally. we may be able to ie. mount snowflake or bigquery tables are duckdb tables
# Conflicts: # dlt/destinations/impl/filesystem/filesystem.py
remove resource based dataset accessor
9538229
to
13ec73b
Compare
# set up connection and dataset | ||
self._existing_views: List[str] = [] # remember which views already where created | ||
|
||
self.autocreate_required_views = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this flag and why it is set to False here? is querying INFORMATION_SCHEMA a problem? IMO we should fix execute_query below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some bug in sqlglot that throws an internal exception when parsing the information schema query, and I think for now this method is better than catching all exceptions and ignore them in the sqlglot parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlglot was complaining that it cannot parse parametrized queries, which is fair... this was also a good opportunity to fix things so I'm skipping parsing in this case now
tests/load/test_read_interfaces.py
Outdated
|
||
# now we can use the filesystemsql client to create the needed views | ||
fs_client: Any = pipeline.destination_client() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the interface is already there. it is called update_stored_schema
on job client. in this case filesystem
is a staging destination for duckdb
and you create all tables as views, pass the credentials etc. maybe we need some convenience method that creates dataset
instance out of such strucutre (so we take not just destination but also staging as input to dataset).
the nice thing is that all this duckdb related code that creates views and does permission handover could go to duckdb client.
this is a good followup ticket
# Conflicts: # docs/website/docs/dlt-ecosystem/visualizations/exploring-the-data.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code is good now but we miss a few tests specific to sql_client for filesystem:
- make sure you can "open_connection" several times on sql_client
- pass external connection ie. to duckdb that is persisted and try to create a few views then use it from existing connection (or after reopen of the persistent db)
- test if we skip creating view for tables from "foreign" schemas (not dataset) ie. by querying a table with known table but with schema mismatch
ideal we'd move checks in test_read_interfaces.py done only for filesystem to the specific tests. we can do it later as well
…t a few more things
@rudolfix I think all is addressed in the additional commit I just made. I'm not quite sure what you mean with testing open_connection several times. I am testing this by opening the sql_client context on the filesystem a few times in a row now, but there is no test for parallel access, if that is what you need lmk. |
c41ecca
to
631d50b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
9885c5a
to
41926ae
Compare
8658fa8
to
fb9a445
Compare
Description
As an alternative to the ibis integration, we are testing out wether we can create our own data reader with not too much effort that works across all destinations.
Ticket for followup work after this PR is here: https://github.com/orgs/dlt-hub/projects/9/views/1?pane=issue&itemId=80696433
TODO
DBApiCursorImpl
to support arrow tables (some native cursors support arrow)DBApiCursorImpl