feat(datafusion): implement the project node to add the partition columns #1602

fvaleye · 2025-08-13T13:16:45Z

Which issue does this PR close?

Closes Implement Project Node: Caculate partition value #1542

What changes are included in this PR?

Implement a physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

Are these changes tested?

Yes, with unit tests

…umns defined in Iceberg. Implement physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

…to-datafusion

CTTY · 2025-08-21T01:20:29Z

crates/integrations/datafusion/src/physical_plan/project.rs

+            let field_path = Self::find_field_path(&self.table_schema, source_field.id)?;
+            let index_path = Self::resolve_arrow_index_path(batch_schema.as_ref(), &field_path)?;
+
+            let source_column = Self::extract_column_by_index_path(batch, &index_path)?;


This looks very interesting! I actually came across the similar issue when implementing the sort node, and I was leaning toward implementing a new SchemaWithPartnerVisitor, wdyt?

Perfect 👌
I was initially thinking this was needed just for this implementation, but it seems the right place would be closer to the Schema definition. Since this is a standard method for accessing column values by index, it makes sense to generalize!

I drafted a PartitionValueVisitor here to help extract partition values from a record batch in tree-traversal style

Pleast let me know what you think!

I just saw this implementation to extract partition values and it actually makes more sense to me that it leverages the existing RecordBatchProjector: #1040

Good, thanks for sharing. I will use #1040 when merged!

Hey @CTTY 👋,
I can use it now, but I have one concern about leveraging RecordBatchPartitionSplitter, it relies on PARQUET_FIELD_ID_META_KEY.
Since DataFusion doesn’t use this key, do you think we should adapt this method to make it compatible with DataFusion?

fvaleye · 2025-08-21T13:42:50Z

crates/integrations/datafusion/src/physical_plan/project.rs

+    }
+
+    /// Find the path to a field by its ID (e.g., ["address", "city"]) in the Iceberg schema
+    fn find_field_path(table_schema: &Schema, field_id: i32) -> DFResult<Vec<String>> {


We might need to consider this function as well @CTTY following our discussion here.
It may not be the right place at the moment.

crates/integrations/datafusion/src/physical_plan/project.rs

…n containing all the partitions values

crates/integrations/datafusion/src/physical_plan/project.rs

…to-datafusion

liurenjie1024 · 2025-09-22T09:46:41Z

Hi, @fvaleye is this pr ready for review or you still need some effort to improve it?

fvaleye · 2025-09-22T09:50:48Z

Hi, @fvaleye is this pr ready for review or you still need some effort to improve it?

Hi @liurenjie1024 👋,
I removed the DataFusion integration and its extra node, so it's ready for review.
I tried to use RecordBatchProjector (from this), but it doesn't meet our requirements for DataFusion (it uses PARQUET_FIELD_META_KEY).

So, yes, it's ready for review. However, it might require additional refactoring if we want to make these utility functions more general.
Please tell me what you think!

liurenjie1024

Thanks @fvaleye for this pr! I left some comments to improve, and I still have other questions:

What's the entry point of this module?
Could the entrypoint of this module be a funtion like sth below:

fn porject_with_partition(input: &ExecutionPlan, table: &Table) -> Result<Arc<dyn ExecutionPlan>> {
// This method extend `input` with an extra `PhysicalExpr`, which calculates the partition value.
....
}

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-09-24T10:20:08Z

crates/integrations/datafusion/src/physical_plan/project.rs

+/// Extract a column from a record batch by following an index path.
+/// The index path specifies the column indices to traverse for nested structures.
+#[allow(dead_code)]
+fn extract_column_by_index_path(batch: &RecordBatch, index_path: &[usize]) -> DFResult<ArrayRef> {


Could we reuse RecordBatchProjection?

I tried, but I kept this implementation, the main reasons are below:

1. Metadata Dependency:
RecordBatchProjector depends on Arrow field metadata containing PARQUET:field_id
This metadata is added when reading Parquet files through Iceberg's reader
DataFusion ExecutionPlans might not always have this metadata preserved

2. Using the Iceberg table's schema directly
We resolve field paths using field names, not IDs
This works regardless of whether Arrow metadata is present

Depending on what you think:

We could keep this implementation working with DataFusion

Readapt RecordBatchProjection but it feels like it's not the same intent

I'm not convinced. There are two ways to solve your issue:

Add a constructor in RecordBatchProjector to accept iceberg schema and target field ids.

Convert iceberg schema to arrow schema, the converter will add field_id metadata.

Personally I prefer approach 1, but I don't have a strong opinion about. After using RecordBatchProjector, the whole pr could be simplified a lot.

…use PhysicalExpr for partitions values calculation. Signed-off-by: Florian Valeye <[email protected]>

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-09-25T10:09:31Z

crates/integrations/datafusion/src/physical_plan/project.rs

+/// Extract a column from a record batch by following an index path.
+/// The index path specifies the column indices to traverse for nested structures.
+#[allow(dead_code)]
+fn extract_column_by_index_path(batch: &RecordBatch, index_path: &[usize]) -> DFResult<ArrayRef> {


I'm not convinced. There are two ways to solve your issue:

Add a constructor in RecordBatchProjector to accept iceberg schema and target field ids.

Convert iceberg schema to arrow schema, the converter will add field_id metadata.

Personally I prefer approach 1, but I don't have a strong opinion about. After using RecordBatchProjector, the whole pr could be simplified a lot.

crates/integrations/datafusion/src/physical_plan/project.rs

Signed-off-by: Florian Valeye <[email protected]>

crates/integrations/datafusion/src/physical_plan/project.rs

crates/iceberg/src/arrow/record_batch_projector.rs

…ation Signed-off-by: Florian Valeye <[email protected]>

…to-datafusion

crates/iceberg/src/transform/mod.rs

crates/iceberg/src/arrow/mod.rs

crates/integrations/datafusion/src/physical_plan/project.rs

liurenjie1024 · 2025-10-11T07:01:58Z

crates/integrations/datafusion/src/physical_plan/project.rs

+
+impl std::hash::Hash for PartitionExpr {
+    fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
+        std::any::TypeId::of::<Self>().hash(state);


This is a little odd, why not derive Hash for PartitionValueCalculator?

I refactored to use the approach between Hash and PartialEq.
For now, we can't derive Hash for PartitionValueCalculator contains:
partition_type: DataType: Arrow's DataType does not implement Hash
projector: RecordBatchProjector: does not implement Hash
transform_functions: Vec<BoxedTransformFunction> trait objects cannot be hashed

crates/integrations/datafusion/src/physical_plan/project.rs

…e schemas Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 · 2025-10-15T10:16:10Z

crates/integrations/datafusion/src/physical_plan/project.rs

+/// This ensures that:
+/// - All fields in the input schema have matching names in the table schema
+/// - The Arrow data types are compatible with the corresponding Iceberg types
+fn validate_schema_compatibility(


There are several problems with this implementation:

It only checks that arrow schema is a subset of iceberg schema, but we need to ensure that they are exactly same.

It didn't check order of fields.

It didn't take the metadata of nested data type into account.

I would suggest another implementation:

Create a function to visit arrow schema recursively to remove all metadata of all fields, including nested fields.

Remove metadata of input arrow schema and converted arrow schema, compare them using built in == of arrow fields.

Okay, let's add a follow-up PR on this, then.

crates/integrations/datafusion/src/physical_plan/project.rs

…to-datafusion

liurenjie1024

Thanks @fvaleye for this pr, we are almost there!

crates/integrations/datafusion/src/physical_plan/project.rs

…elated issue Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024

Thanks @fvaleye for this pr!

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from b3a8601 to 40a225a Compare August 13, 2025 13:17

feat(datafusion): implement the project node to add the partition col…

4d59f87

…umns defined in Iceberg. Implement physical execution plan node that projects Iceberg partition columns from source data, supporting nested fields and all Iceberg transforms.

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 40a225a to 4d59f87 Compare August 13, 2025 14:50

Merge branch 'main' into feature/implement-project-node-for-insert-in…

d930df9

…to-datafusion

CTTY reviewed Aug 21, 2025

View reviewed changes

fvaleye commented Aug 21, 2025

View reviewed changes

CTTY reviewed Aug 22, 2025

View reviewed changes

crates/integrations/datafusion/src/physical_plan/project.rs Outdated Show resolved Hide resolved

feat(datafusion): adapt IcebergProjectExec to use one partition colum…

803199a

…n containing all the partitions values

liurenjie1024 reviewed Aug 28, 2025

View reviewed changes

crates/integrations/datafusion/src/physical_plan/project.rs Outdated Show resolved Hide resolved

fvaleye mentioned this pull request Sep 2, 2025

feat(datafusion): implement the partitioning node for DataFusion to define the partitioning #1620

Open

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 7f8404f to 7be558d Compare September 21, 2025 15:33

Merge branch 'main' into feature/implement-project-node-for-insert-in…

bc805db

…to-datafusion

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 57fe2dd to bc805db Compare September 21, 2025 15:47

liurenjie1024 reviewed Sep 24, 2025

View reviewed changes

feat(datafusion): add the project_with_partition main entrypoint and …

531388f

…use PhysicalExpr for partitions values calculation. Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 reviewed Sep 25, 2025

View reviewed changes

feat(datafusion): reuse RecordBatchProjector for project.

bdc6aae

Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 reviewed Sep 26, 2025

View reviewed changes

refactor: simplify RecordBatchProjector and optimize partition calcul…

d4fd336

…ation Signed-off-by: Florian Valeye <[email protected]>

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from edb4719 to d4fd336 Compare October 1, 2025 11:54

Merge branch 'main' into feature/implement-project-node-for-insert-in…

ac0fd57

…to-datafusion

liurenjie1024 reviewed Oct 11, 2025

View reviewed changes

feat(project): add a validation method between Arrow and Iceberg tabl…

37ac5d5

…e schemas Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 reviewed Oct 15, 2025

View reviewed changes

Merge branch 'main' into feature/implement-project-node-for-insert-in…

174e5e2

…to-datafusion

fvaleye force-pushed the feature/implement-project-node-for-insert-into-datafusion branch from 21468bf to 174e5e2 Compare October 15, 2025 11:31

liurenjie1024 reviewed Oct 16, 2025

View reviewed changes

crates/integrations/datafusion/src/physical_plan/project.rs Show resolved Hide resolved

crates/integrations/datafusion/src/physical_plan/project.rs Show resolved Hide resolved

feat(datafusion): remove schema validation method and reference the r…

8dc3c2c

…elated issue Signed-off-by: Florian Valeye <[email protected]>

liurenjie1024 approved these changes Oct 21, 2025

View reviewed changes

liurenjie1024 merged commit b3b5afe into apache:main Oct 21, 2025
16 checks passed

feat(datafusion): implement the project node to add the partition columns #1602

feat(datafusion): implement the project node to add the partition columns #1602

Uh oh!

Conversation

fvaleye commented Aug 13, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 commented Sep 22, 2025

Uh oh!

fvaleye commented Sep 22, 2025

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fvaleye Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

fvaleye Oct 15, 2025 •

edited

Loading