Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

Closed
t1g0rz opened this issue Nov 27, 2024 · 3 comments · Fixed by #3050
Closed
Assignees
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed

Comments

@t1g0rz
Copy link
Contributor

t1g0rz commented Nov 27, 2024

Environment

Delta-rs version: 0.20.2 (also checked on 0.22.0)

Binding: python


Bug

What happened:
If an int64 column (I haven’t checked other types) is specified in the Delta table schema, and this table is updated using a Pandas DataFrame where that column is of object type, the underlying Parquet file will store the data as string. However, when querying the schema, it will show int64, and the data returned will also be of int64 type.
In this case, there seems to be an inconsistency. Pl look at MRE.

What you expected to happen:
I expect it to:

  • throw an exception, as it does when the types are completely incompatible (e.g., bool and string): DeltaError: Generic DeltaTable error: type_coercion; or
  • cast datatypes to those specified in the schema

How to reproduce it:

from deltalake import DeltaTable
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


DeltaTable.create(
    "tmp",
    schema=pa.schema([pa.field("something", pa.int64())]),
)

dt = DeltaTable("tmp")
dt.merge(
    pd.DataFrame({"something": map(str, range(10))}),
    predicate="s.something = t.something",
    source_alias="s",
    target_alias="t",
).when_not_matched_insert_all().execute()

dt = DeltaTable("tmp")
print("pd:", dt.to_pandas().dtypes, "\n")
print("delta:", dt.schema(), "\n")
print("pa dataset:", dt.to_pyarrow_dataset().schema, "\n")
print("-----")
print("parquet:", pq.read_table("tmp/").schema, "\n")

Output:

pd: something    int64

delta: Schema([Field(something, PrimitiveType("long"), nullable=True)]) 

pa dataset: something: int64 

-----
parquet: something: string 

More details:
If one tries to scan such a delta table with polars>=1.13.0, they will see a SchemaError

pl.scan_delta("tmp").collect() # SchemaError: dtypes differ for column something: Utf8View != Int64
@t1g0rz t1g0rz added the bug Something isn't working label Nov 27, 2024
@ion-elgreco
Copy link
Collaborator

It probably would solve it if we just add an explicit schema cast at the end of merge

@ion-elgreco
Copy link
Collaborator

@t1g0rz you mind adding a fix for this?

I think just adding a cast after the projection in the MERGE execution should do the trick

@ion-elgreco ion-elgreco added help wanted Extra attention is needed good first issue Good for newcomers labels Dec 7, 2024
@t1g0rz
Copy link
Contributor Author

t1g0rz commented Dec 10, 2024

take
I'll try to fix that, but I warn you that my experience with rust is limited))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants