Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve conversion between arrow and daft types #3605

Open
andrewgazelka opened this issue Dec 18, 2024 · 2 comments
Open

improve conversion between arrow and daft types #3605

andrewgazelka opened this issue Dec 18, 2024 · 2 comments
Assignees

Comments

@andrewgazelka
Copy link
Member

andrewgazelka commented Dec 18, 2024

See https://github.com/Eventual-Inc/Daft/pull/3602/files#r1890934526

  • we want to add conversion logic for all combinations at some point
    instead of the special case of just Utf8

right now seems to not work for nested types. for instance

def test_print_schema_nested(spark_session) -> None:
nested_data = [(1, {"name": "John", "age": 30}), (2, {"name": "Jane", "age": 25})]
# Create DataFrame with nested structures
df = spark_session.createDataFrame(nested_data, ["id", "info"])
# Print schema
print("DataFrame Schema:")
df.printSchema()
we still get

thread 'tokio-runtime-worker' panicked at src/arrow2/src/array/utf8/mod.rs:518:5:
Cannot change array type from Utf8 to LargeUtf8
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
@universalmind303
Copy link
Contributor

note: this logic is incorrect. https://github.com/universalmind303/Daft/blob/07f6b2c82ad5f8a288e6c48db58b23e1d105c558/src/daft-core/src/array/ops/from_arrow.rs#L86

We can't just .convert_logical_type, but we need a physical cast to change the offsets. Seems like this is the same issue with utf8/largeutf8. Not all branches are going through the dataArray like implemented in #3602.

@universalmind303
Copy link
Contributor

so it turns out spark connect does not support large_lists at all. https://issues.apache.org/jira/browse/SPARK-50626.

I really don't want to need to physical cast both on spark -> daft and daft -> spark, but we may need to do that for now until this issue is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants