Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-36795: [C#] Implement support for dense and sparse unions #36797

Merged
merged 21 commits into from
Sep 25, 2023

Conversation

CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented Jul 21, 2023

What changes are included in this PR?

Support dense and sparse unions in the C# implementation.
Adds Archery support for C# unions.

Are these changes tested?

Yes

Are there any user-facing changes?

Unions are now supported in the C# implementation.

This PR includes breaking changes to public APIs.

The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal.

The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered.

The metadata version was finally increased to V5.
 

@github-actions
Copy link

⚠️ GitHub issue #36795 has been automatically assigned in GitHub to PR creator.

{
UnionMode.Sparse => "+us:",
UnionMode.Dense => "+ud:",
_ => throw new InvalidDataException($"Unsupported time unit for export: {unionType.Mode}"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_ => throw new InvalidDataException($"Unsupported time unit for export: {unionType.Mode}"),
_ => throw new InvalidDataException($"Unsupported union mode for export: {unionType.Mode}"),

: base(data)
{
ValidateMode(UnionMode.Dense, Type.Mode);
data.EnsureBufferCount(2); // TODO:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link this TODO to a new issue? (and what is the TODO about, given SparseUnionArray lacks one?)

Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher Sep 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heck if I can remember... . Will just remove the TODO.

@lidavidm
Copy link
Member

lidavidm commented Sep 5, 2023

Hmm, it seems C# producing unions fails in the integration test, though

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Sep 5, 2023
@CurtHagenlocher
Copy link
Contributor Author

Hmm, it seems C# producing unions fails in the integration test, though

I thought this had worked before, but it clearly can't have. There may be more than one problem, but the obvious one is the exception thrown when comparing the data generated from the JSON template with the checked-in data. The template for union test data contains duplicate names -- two fields named "struct" and two fields named "dense". One of each field is marked as nullable in the JSON and one as non-nullable. But the data file that's been checked-in shows all four fields as being non-nullable. I'm not sure what to make of this. I don't see code in the Java or C++ implementations which special-case anything about nullability -- either in reading from the JSON template or in the implementation classes themselves.

@lidavidm
Copy link
Member

lidavidm commented Sep 6, 2023

Ah...right before 1.0 unions were changed: they no longer have a top-level nullability. I guess the files may not have been changed to reflect this.

@CurtHagenlocher
Copy link
Contributor Author

I still don't understand how this is working for the other implementations. I've looked at the code for Field and for the Field comparers and for the JSON import and there's no special-case which says "ignore the nullability bit on Field if this is a Union". I'll have to come back to it later.

@CurtHagenlocher
Copy link
Contributor Author

CurtHagenlocher commented Sep 10, 2023

It looks like some changes made to Archery in the last week have fixed the errors we were getting. I'm ... not entirely comfortable with that, I guess, but I spent a lot of time trying to understand if/what the C# implementation was doing differently without any success.

Edit: nope, looks like one problem solved, one still remains.

@lidavidm
Copy link
Member

Likely b957847? Which does now explicitly generate union fields with/without nullability to check what happens.

@lidavidm
Copy link
Member

The Java error makes it sound like C# is still writing with metadata version 4:

arrow/format/Schema.fbs

Lines 38 to 48 in fa43106

/// >= 0.8.0 (December 2017). Non-backwards compatible with V3.
V4,
/// >= 1.0.0 (July 2020. Backwards compatible with V4 (V5 readers can read V4
/// metadata and IPC messages). Implementations are recommended to provide a
/// V4 compatibility mode with V5 format changes disabled.
///
/// Incompatible changes between V4 and V5:
/// - Union buffer layout has changed. In V5, Unions don't have a validity
/// bitmap buffer.
V5,

That may also explain things, if it's triggering some compatibility path, e.g.:

// With metadata V4, we can get a validity bitmap.
// Trying to fix up union data to do without the top-level validity bitmap
// is hairy:
// - type ids must be rewritten to all have valid values (even for former
// null slots)
// - sparse union children must have their validity bitmaps rewritten
// by ANDing the top-level validity bitmap
// - dense union children must be rewritten (at least one of them)
// to insert the required null slots that were formerly omitted
// So instead we bail out.
if (out_->null_count != 0 && out_->buffers[0] != nullptr) {
return Status::Invalid(
"Cannot read pre-1.0.0 Union array with top-level validity bitmap");
}
out_->buffers[0] = nullptr;
out_->null_count = 0;
if (out_->length > 0) {
RETURN_NOT_OK(GetBuffer(buffer_index_, &out_->buffers[1]));
if (type.mode() == UnionMode::DENSE) {
RETURN_NOT_OK(GetBuffer(buffer_index_ + 1, &out_->buffers[2]));
}
}

@CurtHagenlocher
Copy link
Contributor Author

Thanks @lidavidm for your help; couldn't have made it work without you!

@lidavidm lidavidm merged commit e55f912 into apache:main Sep 25, 2023
@lidavidm lidavidm removed the awaiting merge Awaiting merge label Sep 25, 2023
@CurtHagenlocher CurtHagenlocher deleted the Union branch September 25, 2023 13:17
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit e55f912.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

etseidl pushed a commit to etseidl/arrow that referenced this pull request Sep 28, 2023
…pache#36797)

### What changes are included in this PR?

Support dense and sparse unions in the C# implementation.
Adds Archery support for C# unions.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Unions are now supported in the C# implementation.

**This PR includes breaking changes to public APIs.**

The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal.

The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered.

The metadata version was finally increased to V5.
 
* Closes: apache#36795

Authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: David Li <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…pache#36797)

### What changes are included in this PR?

Support dense and sparse unions in the C# implementation.
Adds Archery support for C# unions.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Unions are now supported in the C# implementation.

**This PR includes breaking changes to public APIs.**

The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal.

The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered.

The metadata version was finally increased to V5.
 
* Closes: apache#36795

Authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: David Li <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#36797)

### What changes are included in this PR?

Support dense and sparse unions in the C# implementation.
Adds Archery support for C# unions.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Unions are now supported in the C# implementation.

**This PR includes breaking changes to public APIs.**

The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal.

The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered.

The metadata version was finally increased to V5.
 
* Closes: apache#36795

Authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: David Li <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…pache#36797)

### What changes are included in this PR?

Support dense and sparse unions in the C# implementation.
Adds Archery support for C# unions.

### Are these changes tested?

Yes

### Are there any user-facing changes?

Unions are now supported in the C# implementation.

**This PR includes breaking changes to public APIs.**

The public APIs for the UnionArray and UnionType were changed fairly substantially. As these were previously not implemented properly, the impact of the changes ought to be minimal.

The ChunkedArray and Column classes were changed to hold IArrowArrays instead of Arrays. To accomodate this, a constructor was added which may introduce ambiguity in calling code. This could be avoided by changing the overloaded constructor to instead be a factory method. This didn't seem worthwhile but could be reconsidered.

The metadata version was finally increased to V5.
 
* Closes: apache#36795

Authored-by: Curt Hagenlocher <[email protected]>
Signed-off-by: David Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C#] Implement support for dense and sparse unions
2 participants