Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark connect fails when performing a .show() #3498

Open
universalmind303 opened this issue Dec 5, 2024 · 3 comments
Open

spark connect fails when performing a .show() #3498

universalmind303 opened this issue Dec 5, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@universalmind303
Copy link
Contributor

universalmind303 commented Dec 5, 2024

Describe the bug

# %%
import daft
from daft.daft import connect_start
from pyspark.sql import SparkSession


server = connect_start()
url = f"sc://localhost:{server.port()}"
session = SparkSession.builder.appName("DaftConfigTest").remote(url).getOrCreate()


session.createDataFrame([("cory", 100)], ["name", "age"]).show()

results in "Error in Daft server: Unsupported relation type: ShowString"

To Reproduce

No response

Expected behavior

No response

Component(s)

Other

Additional context

It appears that currently our show logic exists purely in python. As a prerequisite, we'll need to refactor that logic into rust so that it can be used from spark connect.

@andrewgazelka
Copy link
Member

andrewgazelka commented Dec 6, 2024

Describe the bug

# %%
import daft
from daft.daft import connect_start
from pyspark.sql import SparkSession


server = connect_start()
url = f"sc://localhost:{server.port()}"
session = SparkSession.builder.appName("DaftConfigTest").remote(url).getOrCreate()


session.createDataFrame([("cory", 100)], ["name", "age"]).show()

results in "Error in Daft server: Unsupported relation type: ShowString"

To Reproduce

No response

Expected behavior

No response

Component(s)

Other

Additional context

It appears that currently our show logic exists purely in python. As a prerequisite, we'll need to refactor that logic into rust so that it can be used from spark connect.

also note createDataFrame with strings is currently bugged (I am working on fixing it)

Can we just use Display rust impl for now?

@universalmind303
Copy link
Contributor Author

Can we just use Display rust impl for now?

no, the logical plan display is equivalent to df.explain().

df.show() shows a small sample of the materialized dataset similar to df.limit(10).collect()

@universalmind303
Copy link
Contributor Author

@andrewgazelka I can take on this one.

universalmind303 added a commit that referenced this issue Dec 9, 2024
most of this is ported from the python impl inside
`daft/runners/partitioning.py`.


### Note for reviewer. 

For context around why this is needed. The `DataFrame` class uses
`PartitionSet` extensively for various common operations such as `show`,
and `collect`. In order to add this functionality to our spark connect
implementation, we need a similar construct in rust.

Ideally, I'd like to port over the python implementation to use this new
rust one, but there are still a few things that I'm not entirely sure
how to implement (such as `RayPartitionSet`)

Not all of the methods inside `partitioning.rs` are used yet, But I
intend to follow up this PR with an implementation for
#3498, and this is a
prerequisite as `show` relies on `get_preview_micropartitions`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants