pyarrow `DictionaryArray` as partition column for `write_deltalake` fails #2969

jorritsandbrink · 2024-11-01T09:01:47Z

Environment

Delta-rs version: 0.21.0

Binding: python

Environment: local, WSL2, Ubuntu 24.04.1 LTS

Bug

What happened:
_internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse when using pyarrow DictionaryArray as partition column for write_deltalake.

What you expected to happen:
Successful write.

How to reproduce it:

import pyarrow as pa
from deltalake import write_deltalake

# pyarrow.lib.DictionaryArray
array = pa.array(["a", "b", "c"], type=pa.dictionary(pa.int8(), pa.string()))

data = {
    "foo": [1, 2, 3],
    "bar": [1, 2, 3],
    "baz": array,
    # "baz": ["a", "b", "c"],  # using this instead works
}
table = pa.table(data)

# write to partitioned delta table
write_deltalake("my_delta_table", table, partition_by="baz")

# _internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse

More details:

Traceback (most recent call last):
  File "/home/j/repos/dlt/mre.py", line 16, in <module>
    write_deltalake("my_delta_table", table, partition_by="baz")
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/writer.py", line 323, in write_deltalake
    write_deltalake_rust(
_internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse

The text was updated successfully, but these errors were encountered:

leanadah · 2024-11-15T22:58:10Z

Hi there, has this been resolved?

ion-elgreco · 2024-11-24T13:25:27Z

@leanadah @jorritsandbrink

This is caused by Scalar::from_array

delta-rs/crates/core/src/writer/record_batch.rs

Lines 455 to 457 in c623644

    
           Scalar::from_array(&col.slice(range.start, range.end - range.start), 0).ok_or( 
        
               DeltaWriterError::MissingPartitionColumn("failed to parse".into()), 
        
           )

.

Which is coming from delta-kernel-rs. Please create an upsteam issue there https://github.com/delta-io/delta-kernel-rs.

CC @nicklan @zachschuermann @hntd187

roeap · 2024-12-11T10:54:08Z

Not entirely sure anymore and could not find an explicit mention in the protocol at a quick glance, but I do believe that complex types are not supported for partition values.

The only "hint" I could find though is that the documentation for partition value serialization omits these complex types.

ion-elgreco · 2024-12-12T06:58:48Z

@roeap then we could add a simple check and raise if those columns are complex types

roeap · 2024-12-12T15:22:46Z

absolutely - I do believe we should try and understand though what is happening today. from the repot it seems, this right now sometimes work, despite theoretically both approaches at least represent the same data.

Probably be we not match on dict encoded arrays somewhere...

It would be unfortunate if many people already use that in the wild - i.e. it does work somehow. In that case maybe we emit a waring for now and break in 1.0?

hntd187 · 2024-12-12T15:29:43Z

I believe complex types have never been supported even in old hive style tables. Complex types don't have directly discernable equality and ordering. Is there a use case @jorritsandbrink you are trying to solve here that a complex type partition column was necessary?

jorritsandbrink · 2024-12-13T15:11:04Z

@hntd187 We have a load identifier (string) that is unique for each pipeline run. We store it in a column alongside the data. Records loaded in the same run all have the same value in the load identifier column. Using dictionary encoding for this column reduces data size a lot. Our queries filter on the load identifier and we like to partition the column for data skipping.

hntd187 · 2024-12-13T16:34:03Z

So partition values are not physically stored in parquet files, they are normally kept in delta logs and projected into the data on a per partition basis. I would try instead of using a dictionary encoding just a standard string column and partitioning on that instead. If you create a table without the partitioning then you should notice the strings are kept in the physical parquet files, so by just using a normal string partitioning you more or less get the same benefits.

The problem I think, as Robert mentioned above, is that partition values have to be string serializable which I do not think a dictionary array has an obvious method of being string serialized, but I might be wrong here.

jorritsandbrink · 2024-12-18T08:49:36Z

So partition values are not physically stored in parquet files, they are normally kept in delta logs and projected into the data on a per partition basis.

Right! Completely overlooked that.

I would try instead of using a dictionary encoding just a standard string column and partitioning on that instead. If you create a table without the partitioning then you should notice the strings are kept in the physical parquet files, so by just using a normal string partitioning you more or less get the same benefits.

We are indeed using a regular string column instead, and it's good knowing that it probably won't negatively impact performance.

In that case, there is no clear use case for having a DictionaryArray as partition column—at least not from my end.

jorritsandbrink added the bug Something isn't working label Nov 1, 2024

jorritsandbrink mentioned this issue Nov 4, 2024

enable delta partitioning on arrow normalizer load id dlt-hub/dlt#2022

Merged

ion-elgreco added the on-hold Issues and Pull Requests that are on hold for some reason label Nov 24, 2024

rtyler added this to the Rust v1.0.0 milestone Nov 24, 2024

ion-elgreco removed this from the Rust v1.0.0 milestone Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow `DictionaryArray` as partition column for `write_deltalake` fails #2969

pyarrow `DictionaryArray` as partition column for `write_deltalake` fails #2969

jorritsandbrink commented Nov 1, 2024

leanadah commented Nov 15, 2024

ion-elgreco commented Nov 24, 2024

roeap commented Dec 11, 2024

ion-elgreco commented Dec 12, 2024

roeap commented Dec 12, 2024

hntd187 commented Dec 12, 2024

jorritsandbrink commented Dec 13, 2024

hntd187 commented Dec 13, 2024

jorritsandbrink commented Dec 18, 2024

pyarrow DictionaryArray as partition column for write_deltalake fails #2969

pyarrow DictionaryArray as partition column for write_deltalake fails #2969

Comments

jorritsandbrink commented Nov 1, 2024

Environment

Bug

leanadah commented Nov 15, 2024

ion-elgreco commented Nov 24, 2024

roeap commented Dec 11, 2024

ion-elgreco commented Dec 12, 2024

roeap commented Dec 12, 2024

hntd187 commented Dec 12, 2024

jorritsandbrink commented Dec 13, 2024

hntd187 commented Dec 13, 2024

jorritsandbrink commented Dec 18, 2024

pyarrow `DictionaryArray` as partition column for `write_deltalake` fails #2969

pyarrow `DictionaryArray` as partition column for `write_deltalake` fails #2969