Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

t1g0rz · 2024-11-27T20:49:01Z

Environment

Delta-rs version: 0.20.2 (also checked on 0.22.0)

Binding: python

Bug

What happened:
If an int64 column (I haven’t checked other types) is specified in the Delta table schema, and this table is updated using a Pandas DataFrame where that column is of object type, the underlying Parquet file will store the data as string. However, when querying the schema, it will show int64, and the data returned will also be of int64 type.
In this case, there seems to be an inconsistency. Pl look at MRE.

What you expected to happen:
I expect it to:

throw an exception, as it does when the types are completely incompatible (e.g., bool and string): DeltaError: Generic DeltaTable error: type_coercion; or
cast datatypes to those specified in the schema

How to reproduce it:

from deltalake import DeltaTable
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


DeltaTable.create(
    "tmp",
    schema=pa.schema([pa.field("something", pa.int64())]),
)

dt = DeltaTable("tmp")
dt.merge(
    pd.DataFrame({"something": map(str, range(10))}),
    predicate="s.something = t.something",
    source_alias="s",
    target_alias="t",
).when_not_matched_insert_all().execute()

dt = DeltaTable("tmp")
print("pd:", dt.to_pandas().dtypes, "\n")
print("delta:", dt.schema(), "\n")
print("pa dataset:", dt.to_pyarrow_dataset().schema, "\n")
print("-----")
print("parquet:", pq.read_table("tmp/").schema, "\n")

Output:

pd: something    int64

delta: Schema([Field(something, PrimitiveType("long"), nullable=True)]) 

pa dataset: something: int64 

-----
parquet: something: string

More details:
If one tries to scan such a delta table with polars>=1.13.0, they will see a SchemaError

pl.scan_delta("tmp").collect() # SchemaError: dtypes differ for column something: Utf8View != Int64

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-11-27T21:07:11Z

It probably would solve it if we just add an explicit schema cast at the end of merge

ion-elgreco · 2024-12-07T16:11:23Z

@t1g0rz you mind adding a fix for this?

I think just adding a cast after the projection in the MERGE execution should do the trick

t1g0rz · 2024-12-10T01:19:15Z

take
I'll try to fix that, but I warn you that my experience with rust is limited))

t1g0rz added the bug Something isn't working label Nov 27, 2024

ion-elgreco added help wanted Extra attention is needed good first issue Good for newcomers labels Dec 7, 2024

github-actions bot assigned t1g0rz Dec 10, 2024

t1g0rz mentioned this issue Dec 10, 2024

fix: add explicit type casts while merge #3050

Merged

ion-elgreco closed this as completed in #3050 Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

t1g0rz commented Nov 27, 2024

ion-elgreco commented Nov 27, 2024

ion-elgreco commented Dec 7, 2024

t1g0rz commented Dec 10, 2024 •

edited

Loading

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

Comments

t1g0rz commented Nov 27, 2024

Environment

Bug

ion-elgreco commented Nov 27, 2024

ion-elgreco commented Dec 7, 2024

t1g0rz commented Dec 10, 2024 • edited Loading

t1g0rz commented Dec 10, 2024 •

edited

Loading