-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AirbyteLib: Add Lazy Datasets and iterator syntax support for datasets, caches, and read results #34429
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
@property | ||
def _streams_with_data(self) -> set[str]: | ||
"""Return a list of known streams.""" | ||
return self._pending_batches.keys() | self._finalized_batches.keys() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new. Basically, this tells us all the streams we've seen so far.
Per Joe, does this work?: for stream_name, dataset in result.streams.items():
... Update: @flash1293 - Confirmed this syntax works as expected, and I've added tests that confirm this. In the process I noticed a look-alike |
Given a source and cache declared something like this...
This adds support for all of these syntaxes:
Above, the
.streams
property exists as a more explicit reference than calling iterator or key operations directly against the object.CachedDatasets and SQLDataset now also have a "with_filter()" method that creates a new SQLDataset by filtering down the original dataset. This is done lazily, so nothing is done on the connection until the dataset is invokes as an iterator or its data is attempted to be read.
These are all equivalent:
Other points:
to_sql_table()
but CachedDataset objects do.filtered_dataset: SQLDataset = cached_dataset.with_filter("column2 == 1").with_filter("column1 == 'value1'")
self._streams_with_data
. This is a subset of streams that are declared in the catalog, and this is also where a cache might contain different members than a read result - even if the underlying data is the same.