Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33592: [C++] support casting nullable fields to non-nullable if there are no null values #43782

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Aug 21, 2024

Notes for myself/fixer:

  • tests that need to get updated (almost definitely not a complete list)
  • hmm, looks like go wrapper does its own nullability checks. I assume this is just an optimization to not have to go into the C++ and hit the error down there. So if we just delete this check then I think the C++ logic will handle all of it???
  • how the plain column handles this cast, some logic like this probably needs to get ported over to the struct implementation
  • cmake .. --preset ninja-debug-basic, then cmake --build ., then ctest -j16 --output-on-failure to run all tests, or ctest -R 'arrow-compute-scalar-cast-test' to run the specific one

Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@NickCrews NickCrews changed the title [C++][Go]: support casting nullable fields to non-nullable if there are no null values GH-33592 [C++][Go]: support casting nullable fields to non-nullable if there are no null values Aug 21, 2024
Copy link

⚠️ GitHub issue #33592 has been automatically assigned in GitHub to PR creator.

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch 3 times, most recently from 52100dd to 49d5a99 Compare August 22, 2024 00:40
@rustyconover
Copy link

This PR looks pretty good to me, what additional help would you like getting it merged?

@NickCrews
Copy link
Contributor Author

Thanks for the help! I'm a random contributor, so I'm not sure about the exact workflow, but I think both getting the CI run approved and having a C++ owner approve it are needed.

Any more detailed thoughts on if the tests are adequate, if there's anywhere else in the code base that you think needs to change, my handling of the go implementation by just deleting the shortcircuit, or any other more detailed thoughts? Basically anything that would reduce the mental load on the code owner I think would increase the odds they approve it :)

@mapleFU
Copy link
Member

mapleFU commented Sep 15, 2024

You can mark as ready for review? Or this is still wip?

@NickCrews NickCrews marked this pull request as ready for review September 15, 2024 20:11
@NickCrews
Copy link
Contributor Author

Oops, didn't mean for this to still be marked WIP. Looks like a few failures that need to get fixed, but still a high-level review of the general approach would still be appreciated in the meantime.

@zeroshade
Copy link
Member

You probably need the same corresponding check on the Go side to verify that it's only allowed if there are no nulls.

Also, the Go implementation has been moved to the apache/arrow-go repository. So please file the PR there for the Go side. Sorry for the confusion, we just haven't removed the Go code from here yet.

@NickCrews
Copy link
Contributor Author

Thanks @zeroshade , will do. Do we need both PRs to land at about the same time, or can I do that independently?

I will undo my changes to the go code here, leaving it untouched.

@zeroshade
Copy link
Member

They can land independently, no issues there. Thanks!

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from 49d5a99 to eb9e7b6 Compare September 16, 2024 16:27
@NickCrews
Copy link
Contributor Author

Just pushed a new version:

  • dropped the Go changes
  • rebased on top of 3600db8, which is one commit behind main, because main is failing CI, but the commit I chose passed CI. I did this because I think the failing CI checks in this PR are not related to this PR
  • I discovered archery and formatted the files with archery lint --clang-format --fix

We will see if this passes CI now...

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from eb9e7b6 to e6d54d3 Compare September 18, 2024 07:49
@NickCrews
Copy link
Contributor Author

Pushed a new version, hopefully that fixed the broken tests:

  • To go from nullable to non-nullable type, need to use the unsafe cast option
  • fix typo of int8 to int64 so the precision matches
  • removed the [Go] tag from the commit message

@NickCrews
Copy link
Contributor Author

@zeroshade I think this is ready to review/merge, the failing CI runs look like flakes when trying to setup the environment?

@kou
Copy link
Member

kou commented Sep 19, 2024

Can we move this to apache/arrow-go?

@kou kou changed the title GH-33592 [C++][Go]: support casting nullable fields to non-nullable if there are no null values GH-33592: [C++] support casting nullable fields to non-nullable if there are no null values Sep 19, 2024
@kou
Copy link
Member

kou commented Sep 19, 2024

Ah, the Go part was removed from this PR.
I've removed "[Go]" from the PR title.

@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from e6d54d3 to 106e627 Compare September 24, 2024 18:53
@NickCrews
Copy link
Contributor Author

NickCrews commented Sep 24, 2024

Those CI failures look unrelated? Not sure, CI history looks pretty flaky to me.

[----------] 1 test from LargeRowEncryptionTest
[ RUN      ] LargeRowEncryptionTest.ReadEncryptLargeRows
/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:159: Failure
Failed
'_error_or_value28.status()' failed with IOError: AesDecryptor was wiped outDeserializing page header failed.

/arrow/cpp/src/arrow/dataset/file_parquet_encryption_test.cc:265: Failure
Expected: TestScanDataset() doesn't generate new fatal failures in the current thread.
  Actual: it does.
[  FAILED  ] LargeRowEncryptionTest.ReadEncryptLargeRows (49 ms)
[----------] 1 test from LargeRowEncryptionTest (49 ms total)

yota-p added a commit to yota-p/iceberg-python that referenced this pull request Nov 16, 2024
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch 4 times, most recently from f2e4079 to fdb99d1 Compare December 13, 2024 09:40
@NickCrews NickCrews force-pushed the field-cast-nullable-to-nonnullable branch from fdb99d1 to 1b2c940 Compare December 13, 2024 10:19
@NickCrews
Copy link
Contributor Author

The current CI failures look to be caused by #44985 (or similar), not by this PR.

I rebased on main, everything looks pretty much the same as before.

@zeroshade can you give this a review when you get the chance, or let me know how I can make it easier for you to do so? @rustyconover if you want to take a look again, maybe that will help. Thank you both!

Copy link
Contributor

@zanmato1984 zanmato1984 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small question about null count checking.

if (field_index == kFillNullSentinel) {
ARROW_ASSIGN_OR_RAISE(auto nulls,
MakeArrayOfNull(target_type->GetSharedPtr(), batch.length));
out_array->child_data.push_back(nulls->data());
} else {
const auto& values = (in_array.child_data[field_index].ToArrayData()->Slice(
in_array.offset, in_array.length));
if (!target_field->nullable() && values->null_count > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible if values->null_count is kUnknownNullCount? In which case, we should probably use child array's GetNullCount() instead?

And if we are not sure about that, can we have a dedicated test for it?

Copy link
Member

@mapleFU mapleFU Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think values->null_count can likely to be kUnknownNullCount. values->null_count > 0 means must has null, however, when values->null_count = kUnknownNullCount`, it would be wrongly casted to new field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never heard of kUnknownNullCount. I'm not sure what we should do here, I think I need to know a lot more background. Can someone who has experience with null counts help?

Some questions this brings up:

  1. This sounds like we sometimes don't know the null count, I assume because it is expensive to compute. Would this be a good reason to actually force computation to happen (behavior A), or should we be conservative, and if we find kUnkownNullCount, then we assume that there could be nulls and don't let the cast happen and therefore error (behavior B). Behavior A seems much more correct to me.
  2. I don't think this function should have the responsibility of looking at the child arrays null counts. Should we add some higher level API to arrays like nullCountOrChildNullCount() that actually does this messy work??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never heard of kUnknownNullCount. I'm not sure what we should do here, I think I need to know a lot more background. Can someone who has experience with null counts help?

Sorry I'm not sure if there is a well-documented rule of dealing with kUnknownNullCount, maybe @pitrou does? But in practice, I (and many other developers I suppose) just keep in mind that the null count can be unknown and is lazily computed. One can enforce such computation once the null count is absolutely needed (there are various APIs helping on this).

  1. This sounds like we sometimes don't know the null count, I assume because it is expensive to compute. Would this be a good reason to actually force computation to happen (behavior A), or should we be conservative, and if we find kUnkownNullCount, then we assume that there could be nulls and don't let the cast happen and therefore error (behavior B). Behavior A seems much more correct to me.

Yeah your assumption is right. And I would go A (enforcing the null count computation) because IMO this is essential for what this PR is trying to do. Plus, null count is unknown by default, so B will make this change almost useless.

  1. I don't think this function should have the responsibility of looking at the child arrays null counts. Should we add some higher level API to arrays like nullCountOrChildNullCount() that actually does this messy work??

IMO this is where you must make contact with child arrays, given that you are in the context of dealing with a Struct array - which is essentially a container of child arrays. And the original code already does so. Defining an API like nullCountOrChildNullCount() could be semantically hard/useless because: 1) the validity (null or not) of the whole struct and 2) the validity of each individual field (and you have to specify which field), mean very different things. So the null count of a particular field is a property of this particular field only and should be computed/checked individually.

Copy link
Member

@pitrou pitrou Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just call GetNullCount so that the null count is computed if not already known. Computation is reasonably cheap.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 18, 2024
if (field_index == kFillNullSentinel) {
ARROW_ASSIGN_OR_RAISE(auto nulls,
MakeArrayOfNull(target_type->GetSharedPtr(), batch.length));
out_array->child_data.push_back(nulls->data());
} else {
const auto& values = (in_array.child_data[field_index].ToArrayData()->Slice(
in_array.offset, in_array.length));
if (!target_field->nullable() && values->null_count > 0) {
return Status::TypeError("field '", target_field->name(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, should we discard the buffer[0] if new data doesn't contains any nulls?

if (field_index == kFillNullSentinel) {
ARROW_ASSIGN_OR_RAISE(auto nulls,
MakeArrayOfNull(target_type->GetSharedPtr(), batch.length));
out_array->child_data.push_back(nulls->data());
} else {
const auto& values = (in_array.child_data[field_index].ToArrayData()->Slice(
in_array.offset, in_array.length));
if (!target_field->nullable() && values->null_count > 0) {
return Status::TypeError("field '", target_field->name(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is data-dependent, this should be Status::Invalid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants