-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds daft-catalog, daft-table and daft-session for catalog integrations. #3805
Conversation
CodSpeed Performance ReportMerging #3805 will degrade performances by 36.82%Comparing Summary
Benchmarks breakdown
|
5d70408
to
b60e0b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, left some comments and questions about backward compatibility/how to move our users off of all our current Catalog interfaces to the new one
Also wrt examples, I'm most curious about e2e examples from the user perspective for these -- I understand not everything is implemented yet, but would love to maybe have them in a doc where we can just view the overall UX from a Daft user perspective.
- For a user that just imports Daft and "has their catalogs configured for them already" how do they:
- List available tables?
- Read a table into a dataframe?
- Write into a table (append vs overwrite vs overwrite-dynamic)?
- Upsert? (this one is complicated, open to just stubbing APIs if we can at least show the user flow)
- For a user that would like to configure their session/catalog programmatically:
- How do I register my AWS Glue/Unity/Iceberg REST/Hive Metastore catalogs as a catalog in Daft so that it can find tables?
sess = Session.empty() | ||
# | ||
# setup. | ||
cat1 = Catalog.empty() | ||
cat2 = Catalog.empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's kind of un-pythonic to create an empty session or catalog like this. It is more common in languages like rust or jvm languages, in python, I'd expect to use kwargs (or the lack of) to create an empty catalog/session
sess = daft.Session()
cat1 = daft.Catalog()
cat2 = daft.Catalog()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also the import statements are kinda ugly. It'd be nice to reexport them at the top level daft
.
from daft.session import Session
->from daft import Session
from daft.catalog import Catalog
->from daft import Catalog
and with the constructor calls, it really cleans up the api
before
from daft.session import Session
from daft.catalog import Catalog
sess = Session.empty()
cat = Catalog.empty()
after
import daft
sess = daft.Session()
cat = daft.Catalog()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are exported at the top-level.
I'm not sure how about adding a concrete constructor to an abstract (or if it's possible in python?) which would make Catalog()
not possible. I've removed .empty()
in favor of Catalog.from_pydict
which I think addresses part of your concern. I can make Session have a default constructor.
I am curious to know what is the pythonic way to do named constructors for abstract classes? I was looking into this and top-level functions seem common, but daft already uses named constructors in other classes so I stuck with those. Jay and I also tried out a top-level factory method but as-is it was ambiguous so we opted to move it into the Catalog abc itself.
Here is what is implemented going to main.
import daft
sess = daft.Session()
cat = daft.Catalog.from_pydict({})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you can add a default implementation to an abstract class, and it's a pattern I've seen in other python projects (lancedb, polars, ...).
class Catalog(ABC):
"""Catalog documentation..."""
def __new__(cls, *args, **kwargs):
if cls is Catalog:
from daft.catalog.__memory import MemoryCatalog
return MemoryCatalog()
else:
return super().__new__(cls)
then you can construct it with a sensible default.
cat = daft.Catalog() # MemoryCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaychia could you please chime in on a python way to handle creating a default catalog implementation? Given my oo background, concrete constructors on abstract classes don't compute and I would like some more input.
@@ -12,6 +17,91 @@ impl PySession { | |||
pub fn empty() -> Self { | |||
Self(Session::empty()) | |||
} | |||
|
|||
pub fn exec(&self, input: &str) -> PyResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this equivalent to the daft.sql
method? if so, can we continue calling it sql
. This is much more common in other libraries and will feel more natural for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and yes it should be sql
on the python side – exec was my mental default given my background with lower-level connection APIs and JDBC. DataFusion uses exec on the cli, but doesn't have an equivalent connection. But I know Spark and Polars use .sql. We want the spark model of having .sql on the session .
|
||
_tables: dict[str, Table] | ||
|
||
def __init__(self, tables: dict[str, Table]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, tables: dict[str, Table]): | |
def __init__(self, tables: dict[str, Table] = {}): |
Closing this now since it has been split into smaller pieces. |
…3820) This is PR [1/3] for integrating #3805 to Daft/main – it's primary purpose is to establish the stable APIs and prepare the cutover to Session for attach/detach and creating temporary tables. **Changes** 1. Define minimal Catalog and Table ABCs. 2. Internalize Catalog/Table implementations. 3. Marks deprecated APIs along with their alternatives. 4. Existing public classes are aliased to the internalized ones for backwards compatibility. Part [2/3] will integrate the session actions (from #3805) to replace deprecated MetaCatalog APIs.
…ng APIs [2/3] (#3825) This is PR [2/3] for integrating #3805 to Daft/main. This PR adds several implementations with basic end-to-end sanity tests. Most notably, this PR includes the bi-directional implementations for Catalogs and Tables. These enable us to use python-backed implementations in rust (for daft-connect and daft-sql) while also being able to return Catalog/Table python objects for use with the python APIs. **Changes** 1. Adds several internal factories for attach/detaching supported objects to the session. 2. Adds `Catalog.from_pydict` to cover basic needs and some backwards compatibility. 3. Adds `Identifier.from_str` for pyiceberg style identifiers (these are entirely optional). 4. Flesh out the [minimal session actions](https://github.com/Eventual-Inc/Daft/compare/rchowell/catalog-1-of-3...rchowell/catalog-2-of-3?expand=1#diff-fc2305e1560f8f7d5a974f58325f4b231bc638d6bacebbc795b6f8fb924114ccR1884-R1895) for a release. 5. Adds Bindings for a single name resolution abstraction, will come in handy for case-normalization in the near future. 6. Updates daft-connect and daft-sql to use `attach_table` Part [3/3] will cutover all deprecated APIs to use these new APIs and remove the unused artifacts. These APIs are enumerated in #3820
…ng APIs [2/3] (#3825) This is PR [2/3] for integrating #3805 to Daft/main. This PR adds several implementations with basic end-to-end sanity tests. Most notably, this PR includes the bi-directional implementations for Catalogs and Tables. These enable us to use python-backed implementations in rust (for daft-connect and daft-sql) while also being able to return Catalog/Table python objects for use with the python APIs. **Changes** 1. Adds several internal factories for attach/detaching supported objects to the session. 2. Adds `Catalog.from_pydict` to cover basic needs and some backwards compatibility. 3. Adds `Identifier.from_str` for pyiceberg style identifiers (these are entirely optional). 4. Flesh out the [minimal session actions](https://github.com/Eventual-Inc/Daft/compare/rchowell/catalog-1-of-3...rchowell/catalog-2-of-3?expand=1#diff-fc2305e1560f8f7d5a974f58325f4b231bc638d6bacebbc795b6f8fb924114ccR1884-R1895) for a release. 5. Adds Bindings for a single name resolution abstraction, will come in handy for case-normalization in the near future. 6. Updates daft-connect and daft-sql to use `attach_table` Part [3/3] will cutover all deprecated APIs to use these new APIs and remove the unused artifacts. These APIs are enumerated in #3820
This is a work in progress and will continue to push changes here. Awaiting #3794