Skip to content

Commit

Permalink
doc: why nullable of list item is set to true (#11626)
Browse files Browse the repository at this point in the history
* doc: why nullable of list item is set to true

* Adds an external doc to avoid repeating text

* rewrite

* redirects to external doc

* Adds ASF license

* Minor: formatting fixes

* Minor: copy edits

* Retrigger CI

* Fixes: name of aggregation in example

In `array_agg` the list is nullable, so changed the example to
`nth_value` where the list is not nullable to be correct.

* Disambiguates list item nullability in copy

---------

Co-authored-by: Andrew Lamb <[email protected]>
  • Loading branch information
jcsherin and alamb authored Jul 26, 2024
1 parent 2eaf1ea commit d6e016e
Show file tree
Hide file tree
Showing 6 changed files with 83 additions and 4 deletions.
77 changes: 77 additions & 0 deletions datafusion/functions-aggregate/COMMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
<!---
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Why Is List Item Always Nullable?

## Motivation

There were independent proposals to make the `nullable` setting of list
items in accumulator state be configurable. This meant adding additional
fields which captured the `nullable` setting from schema in planning for
the first argument to the aggregation function, and the returned value.

These fields were to be added to `StateFieldArgs`. But then we found out
that aggregate computation does not depend on it, and it can be avoided.

This document exists to make that reasoning explicit.

## Background

The list data type is used in the accumulator state for a few aggregate
functions like:

- `sum`
- `count`
- `array_agg`
- `bit_and`, `bit_or` and `bit_xor`
- `nth_value`

In all of the above cases the data type of the list item is equivalent
to either the first argument of the aggregate function or the returned
value.

For example, in `array_agg` the data type of item is equivalent to the
first argument and the definition looks like this:

```rust
// `args` : `StateFieldArgs`
// `input_type` : data type of the first argument
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
Field::new("item", args.input_type.clone(), true /* nullable of list item */ ),
false, // nullable of list itself
)];
```

For all the aggregates listed above, the list item is always defined as
nullable.

## Computing Intermediate State

By setting `nullable` (of list item) to be always `true` like this we
ensure that the aggregate computation works even when nulls are
present. The advantage of doing it this way is that it eliminates the
need for additional code and special treatment of nulls in the
accumulator state.

## Nullable Of List Itself

The `nullable` of list itself depends on the aggregate. In the case of
`array_agg` the list is nullable(`true`), meanwhile for `sum` the list
is not nullable(`false`).
2 changes: 2 additions & 0 deletions datafusion/functions-aggregate/src/array_agg.rs
Original file line number Diff line number Diff line change
Expand Up @@ -89,13 +89,15 @@ impl AggregateUDFImpl for ArrayAgg {
if args.is_distinct {
return Ok(vec![Field::new_list(
format_state_name(args.name, "distinct_array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)]);
}

let mut fields = vec![Field::new_list(
format_state_name(args.name, "array_agg"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
true,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/bit_and_or_xor.rs
Original file line number Diff line number Diff line change
Expand Up @@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
args.name,
format!("{} distinct", self.name()).as_str(),
),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/count.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,7 @@ impl AggregateUDFImpl for Count {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "count distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)])
Expand Down
5 changes: 1 addition & 4 deletions datafusion/functions-aggregate/src/nth_value.rs
Original file line number Diff line number Diff line change
Expand Up @@ -134,10 +134,7 @@ impl AggregateUDFImpl for NthValueAgg {
fn state_fields(&self, args: StateFieldsArgs) -> Result<Vec<Field>> {
let mut fields = vec![Field::new_list(
format_state_name(self.name(), "nth_value"),
// TODO: The nullability of the list element should be configurable.
// The hard-coded `true` should be changed once the field for
// nullability is added to `StateFieldArgs` struct.
// See: https://github.com/apache/datafusion/pull/11063
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.input_type.clone(), true),
false,
)];
Expand Down
1 change: 1 addition & 0 deletions datafusion/functions-aggregate/src/sum.rs
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,7 @@ impl AggregateUDFImpl for Sum {
if args.is_distinct {
Ok(vec![Field::new_list(
format_state_name(args.name, "sum distinct"),
// See COMMENTS.md to understand why nullable is set to true
Field::new("item", args.return_type.clone(), true),
false,
)])
Expand Down

0 comments on commit d6e016e

Please sign in to comment.