Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-50463][SQL][3.5] Fix ConstantColumnVector with Columnar to Row conversion #49131

Closed
wants to merge 1 commit into from

Conversation

LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

800faf0 frees column vector resources between batches in columnar to row conversion. However, like WritableColumnVector, ConstantColumnVector should not free resources between batches because the same data is used across batches

Why are the changes needed?

Without this change, ConstantColumnVectors with string values, for example, will fail if used with column->row conversion. For instance, reading a parquet table partitioned by a string column with multiple batches.

Does this PR introduce any user-facing change?

No

How was this patch tested?

added UT that failed before and now passes

Was this patch authored or co-authored using generative AI tooling?

no

…nversion

### What changes were proposed in this pull request?

apache@800faf0 frees column vector resources between batches in columnar to row conversion. However, like `WritableColumnVector`, `ConstantColumnVector` should not free resources between batches because the same data is used across batches

### Why are the changes needed?

Without this change, ConstantColumnVectors with string values, for example, will fail if used with column->row conversion. For instance, reading a parquet table partitioned by a string column with multiple batches.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

added UT that failed before and now passes

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#49021 from richardc-db/col_to_row_const_col_vec_fix.

Authored-by: Richard Chen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .

@dongjoon-hyun
Copy link
Member

cc @richardc-db and @viirya , too

@LuciferYang
Copy link
Contributor Author

also cc @pan3793

@pan3793
Copy link
Member

pan3793 commented Dec 10, 2024

FYI: SPARK-50235 causes Iceberg test failure (JVM crash), and this follow-up PR does not help, I may not be able to provide conclusions within a short time because I'm not familiar with the "native" world.

See more details at apache/iceberg#11731

@viirya could you please give some guidance?

@dongjoon-hyun
Copy link
Member

Thank you for the reporting, @pan3793 .

@dongjoon-hyun
Copy link
Member

Thank you for re-triggerring the failed test pipelines.

@viirya
Copy link
Member

viirya commented Dec 10, 2024

FYI: SPARK-50235 causes Iceberg test failure (JVM crash), and this follow-up PR does not help, I may not be able to provide conclusions within a short time because I'm not familiar with the "native" world.

See more details at apache/iceberg#11731

@viirya could you please give some guidance?

Does Iceberg implement ColumnVector interface and have it own vector implementations which are similar to writable column vector or constant column vector?

@manuzhang
Copy link
Contributor

@viirya
Copy link
Member

viirya commented Dec 10, 2024

Iceberg has an implementation similar to ArrowColumnVector

Thanks.

So looks like it is a vector that allows writing into its values although it doesn't inherit the writable interface. Then it can overwrite closeIfFreeable.

LuciferYang pushed a commit that referenced this pull request Dec 10, 2024
…ow conversion

### What changes were proposed in this pull request?

800faf0 frees column vector resources between batches in columnar to row conversion. However, like `WritableColumnVector`, `ConstantColumnVector` should not free resources between batches because the same data is used across batches

### Why are the changes needed?
Without this change, ConstantColumnVectors with string values, for example, will fail if used with column->row conversion. For instance, reading a parquet table partitioned by a string column with multiple batches.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT that failed before and now passes

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #49131 from LuciferYang/SPARK-50463-3.5.

Authored-by: Richard Chen <[email protected]>
Signed-off-by: yangjie01 <[email protected]>
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Dec 10, 2024

Merged into branch-3.5 for Apache Spark 3.5.4 RC2. Thanks @dongjoon-hyun @viirya and @pan3793

@dongjoon-hyun
Copy link
Member

Thank you, @LuciferYang and all!

@viirya
Copy link
Member

viirya commented Dec 11, 2024

The method is newly added just in branch-3.5, not released yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants