Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds daft-catalog, daft-table and daft-session for catalog integrations. #3805

Closed
wants to merge 10 commits into from

Conversation

rchowell
Copy link
Contributor

This is a work in progress and will continue to push changes here. Awaiting #3794

@github-actions github-actions bot added the feat label Feb 13, 2025
Copy link

codspeed-hq bot commented Feb 13, 2025

CodSpeed Performance Report

Merging #3805 will degrade performances by 36.82%

Comparing rchowell/catalog (64e5564) with main (87378cd)

Summary

⚡ 1 improvements
❌ 2 regressions
✅ 24 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_count[1 Small File] 3.7 ms 3.3 ms +12.74%
test_iter_rows_first_row[100 Small Files] 184.4 ms 291.9 ms -36.82%
test_show[100 Small Files] 15.6 ms 23.5 ms -33.63%

@rchowell rchowell requested a review from jaychia February 15, 2025 01:39
Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, left some comments and questions about backward compatibility/how to move our users off of all our current Catalog interfaces to the new one

Also wrt examples, I'm most curious about e2e examples from the user perspective for these -- I understand not everything is implemented yet, but would love to maybe have them in a doc where we can just view the overall UX from a Daft user perspective.

  1. For a user that just imports Daft and "has their catalogs configured for them already" how do they:
    • List available tables?
    • Read a table into a dataframe?
    • Write into a table (append vs overwrite vs overwrite-dynamic)?
    • Upsert? (this one is complicated, open to just stubbing APIs if we can at least show the user flow)
  2. For a user that would like to configure their session/catalog programmatically:
    • How do I register my AWS Glue/Unity/Iceberg REST/Hive Metastore catalogs as a catalog in Daft so that it can find tables?

Comment on lines +68 to +72
sess = Session.empty()
#
# setup.
cat1 = Catalog.empty()
cat2 = Catalog.empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's kind of un-pythonic to create an empty session or catalog like this. It is more common in languages like rust or jvm languages, in python, I'd expect to use kwargs (or the lack of) to create an empty catalog/session

sess = daft.Session()

cat1 = daft.Catalog()
cat2 = daft.Catalog()

Copy link
Contributor

@universalmind303 universalmind303 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also the import statements are kinda ugly. It'd be nice to reexport them at the top level daft.

  • from daft.session import Session -> from daft import Session
  • from daft.catalog import Catalog -> from daft import Catalog

and with the constructor calls, it really cleans up the api

before

from daft.session import Session
from daft.catalog import Catalog

sess = Session.empty()
cat = Catalog.empty()

after

import daft

sess = daft.Session()
cat = daft.Catalog()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are exported at the top-level.

I'm not sure how about adding a concrete constructor to an abstract (or if it's possible in python?) which would make Catalog() not possible. I've removed .empty() in favor of Catalog.from_pydict which I think addresses part of your concern. I can make Session have a default constructor.

I am curious to know what is the pythonic way to do named constructors for abstract classes? I was looking into this and top-level functions seem common, but daft already uses named constructors in other classes so I stuck with those. Jay and I also tried out a top-level factory method but as-is it was ambiguous so we opted to move it into the Catalog abc itself.

Here is what is implemented going to main.

import daft

sess = daft.Session()
cat = daft.Catalog.from_pydict({})

Copy link
Contributor

@universalmind303 universalmind303 Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you can add a default implementation to an abstract class, and it's a pattern I've seen in other python projects (lancedb, polars, ...).

class Catalog(ABC):
    """Catalog documentation..."""
    def __new__(cls, *args, **kwargs):
        if cls is Catalog:
            from daft.catalog.__memory import MemoryCatalog
            return MemoryCatalog()
        else:
            return super().__new__(cls)

then you can construct it with a sensible default.

cat = daft.Catalog() # MemoryCatalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaychia could you please chime in on a python way to handle creating a default catalog implementation? Given my oo background, concrete constructors on abstract classes don't compute and I would like some more input.

@@ -12,6 +17,91 @@ impl PySession {
pub fn empty() -> Self {
Self(Session::empty())
}

pub fn exec(&self, input: &str) -> PyResult<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this equivalent to the daft.sql method? if so, can we continue calling it sql. This is much more common in other libraries and will feel more natural for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and yes it should be sql on the python side – exec was my mental default given my background with lower-level connection APIs and JDBC. DataFusion uses exec on the cli, but doesn't have an equivalent connection. But I know Spark and Polars use .sql. We want the spark model of having .sql on the session .


_tables: dict[str, Table]

def __init__(self, tables: dict[str, Table]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __init__(self, tables: dict[str, Table]):
def __init__(self, tables: dict[str, Table] = {}):

@rchowell rchowell closed this Feb 19, 2025
rchowell added a commit that referenced this pull request Feb 20, 2025
…3820)

This is PR [1/3] for integrating #3805 to Daft/main – it's primary
purpose is to establish the stable APIs and prepare the cutover to
Session for attach/detach and creating temporary tables.

**Changes**
1. Define minimal Catalog and Table ABCs.
2. Internalize Catalog/Table implementations.
3. Marks deprecated APIs along with their alternatives.
4. Existing public classes are aliased to the internalized ones for
backwards compatibility.

Part [2/3] will integrate the session actions (from #3805) to replace
deprecated MetaCatalog APIs.
rchowell added a commit that referenced this pull request Feb 21, 2025
…ng APIs [2/3] (#3825)

This is PR [2/3] for integrating #3805 to Daft/main. This PR adds
several implementations with basic end-to-end sanity tests. Most
notably, this PR includes the bi-directional implementations for
Catalogs and Tables. These enable us to use python-backed
implementations in rust (for daft-connect and daft-sql) while also being
able to return Catalog/Table python objects for use with the python
APIs.

**Changes**
1. Adds several internal factories for attach/detaching supported
objects to the session.
2. Adds `Catalog.from_pydict` to cover basic needs and some backwards
compatibility.
3. Adds `Identifier.from_str` for pyiceberg style identifiers (these are
entirely optional).
4. Flesh out the [minimal session
actions](https://github.com/Eventual-Inc/Daft/compare/rchowell/catalog-1-of-3...rchowell/catalog-2-of-3?expand=1#diff-fc2305e1560f8f7d5a974f58325f4b231bc638d6bacebbc795b6f8fb924114ccR1884-R1895)
for a release.
5. Adds Bindings for a single name resolution abstraction, will come in
handy for case-normalization in the near future.
6. Updates daft-connect and daft-sql to use `attach_table` 

Part [3/3] will cutover all deprecated APIs to use these new APIs and
remove the unused artifacts. These APIs are enumerated in #3820
kevinzwang pushed a commit that referenced this pull request Feb 21, 2025
…ng APIs [2/3] (#3825)

This is PR [2/3] for integrating #3805 to Daft/main. This PR adds
several implementations with basic end-to-end sanity tests. Most
notably, this PR includes the bi-directional implementations for
Catalogs and Tables. These enable us to use python-backed
implementations in rust (for daft-connect and daft-sql) while also being
able to return Catalog/Table python objects for use with the python
APIs.

**Changes**
1. Adds several internal factories for attach/detaching supported
objects to the session.
2. Adds `Catalog.from_pydict` to cover basic needs and some backwards
compatibility.
3. Adds `Identifier.from_str` for pyiceberg style identifiers (these are
entirely optional).
4. Flesh out the [minimal session
actions](https://github.com/Eventual-Inc/Daft/compare/rchowell/catalog-1-of-3...rchowell/catalog-2-of-3?expand=1#diff-fc2305e1560f8f7d5a974f58325f4b231bc638d6bacebbc795b6f8fb924114ccR1884-R1895)
for a release.
5. Adds Bindings for a single name resolution abstraction, will come in
handy for case-normalization in the near future.
6. Updates daft-connect and daft-sql to use `attach_table` 

Part [3/3] will cutover all deprecated APIs to use these new APIs and
remove the unused artifacts. These APIs are enumerated in #3820
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants