Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow DictionaryArray as partition column for write_deltalake fails #2969

Open
jorritsandbrink opened this issue Nov 1, 2024 · 9 comments
Labels
bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason
Milestone

Comments

@jorritsandbrink
Copy link

Environment

Delta-rs version: 0.21.0

Binding: python

Environment: local, WSL2, Ubuntu 24.04.1 LTS


Bug

What happened:
_internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse when using pyarrow DictionaryArray as partition column for write_deltalake.

What you expected to happen:
Successful write.

How to reproduce it:

import pyarrow as pa
from deltalake import write_deltalake

# pyarrow.lib.DictionaryArray
array = pa.array(["a", "b", "c"], type=pa.dictionary(pa.int8(), pa.string()))

data = {
    "foo": [1, 2, 3],
    "bar": [1, 2, 3],
    "baz": array,
    # "baz": ["a", "b", "c"],  # using this instead works
}
table = pa.table(data)

# write to partitioned delta table
write_deltalake("my_delta_table", table, partition_by="baz")

# _internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse

More details:

Traceback (most recent call last):
  File "/home/j/repos/dlt/mre.py", line 16, in <module>
    write_deltalake("my_delta_table", table, partition_by="baz")
  File "/home/j/.cache/pypoetry/virtualenvs/dlt-2tG_aB2A-py3.9/lib/python3.9/site-packages/deltalake/writer.py", line 323, in write_deltalake
    write_deltalake_rust(
_internal.DeltaError: Generic DeltaTable error: Missing partition column: failed to parse
@leanadah
Copy link

Hi there, has this been resolved?

@ion-elgreco
Copy link
Collaborator

@leanadah @jorritsandbrink

This is caused by Scalar::from_array

Scalar::from_array(&col.slice(range.start, range.end - range.start), 0).ok_or(
DeltaWriterError::MissingPartitionColumn("failed to parse".into()),
)
.

Which is coming from delta-kernel-rs. Please create an upsteam issue there https://github.com/delta-io/delta-kernel-rs.

CC @nicklan @zachschuermann @hntd187

@ion-elgreco ion-elgreco added the on-hold Issues and Pull Requests that are on hold for some reason label Nov 24, 2024
@rtyler rtyler added this to the Rust v1.0.0 milestone Nov 24, 2024
@roeap
Copy link
Collaborator

roeap commented Dec 11, 2024

Not entirely sure anymore and could not find an explicit mention in the protocol at a quick glance, but I do believe that complex types are not supported for partition values.

The only "hint" I could find though is that the documentation for partition value serialization omits these complex types.

@ion-elgreco
Copy link
Collaborator

@roeap then we could add a simple check and raise if those columns are complex types

@roeap
Copy link
Collaborator

roeap commented Dec 12, 2024

absolutely - I do believe we should try and understand though what is happening today. from the repot it seems, this right now sometimes work, despite theoretically both approaches at least represent the same data.

Probably be we not match on dict encoded arrays somewhere...

It would be unfortunate if many people already use that in the wild - i.e. it does work somehow. In that case maybe we emit a waring for now and break in 1.0?

@hntd187
Copy link
Collaborator

hntd187 commented Dec 12, 2024

I believe complex types have never been supported even in old hive style tables. Complex types don't have directly discernable equality and ordering. Is there a use case @jorritsandbrink you are trying to solve here that a complex type partition column was necessary?

@jorritsandbrink
Copy link
Author

@hntd187 We have a load identifier (string) that is unique for each pipeline run. We store it in a column alongside the data. Records loaded in the same run all have the same value in the load identifier column. Using dictionary encoding for this column reduces data size a lot. Our queries filter on the load identifier and we like to partition the column for data skipping.

@hntd187
Copy link
Collaborator

hntd187 commented Dec 13, 2024

So partition values are not physically stored in parquet files, they are normally kept in delta logs and projected into the data on a per partition basis. I would try instead of using a dictionary encoding just a standard string column and partitioning on that instead. If you create a table without the partitioning then you should notice the strings are kept in the physical parquet files, so by just using a normal string partitioning you more or less get the same benefits.

The problem I think, as Robert mentioned above, is that partition values have to be string serializable which I do not think a dictionary array has an obvious method of being string serialized, but I might be wrong here.

@jorritsandbrink
Copy link
Author

So partition values are not physically stored in parquet files, they are normally kept in delta logs and projected into the data on a per partition basis.

Right! Completely overlooked that.

I would try instead of using a dictionary encoding just a standard string column and partitioning on that instead. If you create a table without the partitioning then you should notice the strings are kept in the physical parquet files, so by just using a normal string partitioning you more or less get the same benefits.

We are indeed using a regular string column instead, and it's good knowing that it probably won't negatively impact performance.

In that case, there is no clear use case for having a DictionaryArray as partition column—at least not from my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working on-hold Issues and Pull Requests that are on hold for some reason
Projects
None yet
Development

No branches or pull requests

6 participants