-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AirbyteLib: Add Lazy Datasets and iterator syntax support for datasets, caches, and read results #34429
Merged
Merged
AirbyteLib: Add Lazy Datasets and iterator syntax support for datasets, caches, and read results #34429
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
776ef1c
revise postgres dependencies
aaronsteers e958779
add tests for lazy, sql, and cached datasets
aaronsteers 21d69ef
working and passing tests: sql, lazy, and cached datasets
aaronsteers d774c3b
updated docs
aaronsteers dd03c49
rename module from _cached to _sql
aaronsteers 0b75ab6
update docs
aaronsteers 746a7ad
lint fixes
aaronsteers 210c0bf
add tests for sql filtering (now passing)
aaronsteers cc7f10f
add cache iterator checks
aaronsteers fa9730b
update docstrings
aaronsteers 53d68ee
Merge branch 'master' into aj/lazy-datasets
aaronsteers 77bf28a
Merge branch 'master' into aj/lazy-datasets
aaronsteers ede6e74
actually run tests
3498dfa
fix docs
aaronsteers cefff31
refactor look-alike `Source.streams` to `Source._selected_stream_names`
aaronsteers 63912cd
update docs
aaronsteers 8cab3ea
add tests
aaronsteers File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,13 @@ | ||
from airbyte_lib.datasets._base import DatasetBase | ||
from airbyte_lib.datasets._cached import CachedDataset | ||
from airbyte_lib.datasets._lazy import LazyDataset | ||
from airbyte_lib.datasets._map import DatasetMap | ||
from airbyte_lib.datasets._sql import CachedDataset, SQLDataset | ||
|
||
|
||
__all__ = [ | ||
"CachedDataset", | ||
"DatasetBase", | ||
"DatasetMap", | ||
"LazyDataset", | ||
"SQLDataset", | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Copyright (c) 2023 Airbyte, Inc., all rights reserved. | ||
from __future__ import annotations | ||
|
||
from collections.abc import Mapping | ||
from typing import TYPE_CHECKING, Any, cast | ||
|
||
from overrides import overrides | ||
from sqlalchemy import and_, text | ||
|
||
from airbyte_lib.datasets._base import DatasetBase | ||
|
||
|
||
if TYPE_CHECKING: | ||
from collections.abc import Iterator | ||
|
||
from pandas import DataFrame | ||
from sqlalchemy import Selectable, Table | ||
from sqlalchemy.sql import ClauseElement | ||
|
||
from airbyte_lib.caches import SQLCacheBase | ||
|
||
|
||
class SQLDataset(DatasetBase): | ||
"""A dataset that is loaded incrementally from a SQL query. | ||
|
||
The CachedDataset class is a subclass of this class, which simply passes a SELECT over the full | ||
table as the query statement. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
cache: SQLCacheBase, | ||
stream_name: str, | ||
query_statement: Selectable, | ||
) -> None: | ||
self._cache: SQLCacheBase = cache | ||
self._stream_name: str = stream_name | ||
self._query_statement: Selectable = query_statement | ||
|
||
@property | ||
def stream_name(self) -> str: | ||
return self._stream_name | ||
|
||
def __iter__(self) -> Iterator[Mapping[str, Any]]: | ||
with self._cache.get_sql_connection() as conn: | ||
for row in conn.execute(self._query_statement): | ||
# Access to private member required because SQLAlchemy doesn't expose a public API. | ||
# https://pydoc.dev/sqlalchemy/latest/sqlalchemy.engine.row.RowMapping.html | ||
yield cast(Mapping[str, Any], row._mapping) # noqa: SLF001 | ||
|
||
def to_pandas(self) -> DataFrame: | ||
return self._cache.get_pandas_dataframe(self._stream_name) | ||
|
||
def with_filter(self, *filter_expressions: ClauseElement | str) -> SQLDataset: | ||
"""Filter the dataset by a set of column values. | ||
|
||
Filters can be specified as either a string or a SQLAlchemy expression. | ||
|
||
Filters are lazily applied to the dataset, so they can be chained together. For example: | ||
|
||
dataset.with_filter("id > 5").with_filter("id < 10") | ||
|
||
is equivalent to: | ||
|
||
dataset.with_filter("id > 5", "id < 10") | ||
""" | ||
# Convert all strings to TextClause objects. | ||
filters: list[ClauseElement] = [ | ||
text(expression) if isinstance(expression, str) else expression | ||
for expression in filter_expressions | ||
] | ||
filtered_select = self._query_statement.where(and_(*filters)) | ||
return SQLDataset( | ||
cache=self._cache, | ||
stream_name=self._stream_name, | ||
query_statement=filtered_select, | ||
) | ||
|
||
|
||
class CachedDataset(SQLDataset): | ||
"""A dataset backed by a SQL table cache. | ||
|
||
Because this dataset includes all records from the underlying table, we also expose the | ||
underlying table as a SQLAlchemy Table object. | ||
""" | ||
|
||
def __init__(self, cache: SQLCacheBase, stream_name: str) -> None: | ||
self._cache: SQLCacheBase = cache | ||
self._stream_name: str = stream_name | ||
self._query_statement: Selectable = self.to_sql_table().select() | ||
|
||
@overrides | ||
def to_pandas(self) -> DataFrame: | ||
return self._cache.get_pandas_dataframe(self._stream_name) | ||
|
||
def to_sql_table(self) -> Table: | ||
return self._cache.get_sql_table(self._stream_name) | ||
|
||
def __eq__(self, value: object) -> bool: | ||
"""Return True if the value is a CachedDataset with the same cache and stream name. | ||
|
||
In the case of CachedDataset objects, we can simply compare the cache and stream name. | ||
|
||
Note that this equality check is only supported on CachedDataset objects and not for | ||
the base SQLDataset implementation. This is because of the complexity and computational | ||
cost of comparing two arbitrary SQL queries that could be bound to different variables, | ||
as well as the chance that two queries can be syntactically equivalent without being | ||
text-wise equivalent. | ||
""" | ||
if not isinstance(value, SQLDataset): | ||
return False | ||
|
||
if self._cache is not value._cache: | ||
return False | ||
|
||
if self._stream_name != value._stream_name: | ||
return False | ||
|
||
return True |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new. Basically, this tells us all the streams we've seen so far.