Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the CONCAT scalar function to support Utf8View #12224

Merged
merged 18 commits into from
Sep 3, 2024
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion datafusion/functions/src/string/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -255,22 +255,29 @@ pub(crate) enum ColumnarValueRef<'a> {
Scalar(&'a [u8]),
NullableArray(&'a StringArray),
NonNullableArray(&'a StringArray),
NullableStringViewArray(&'a StringViewArray),
NonNullableStringViewArray(&'a StringViewArray),
}

impl<'a> ColumnarValueRef<'a> {
#[inline]
pub fn is_valid(&self, i: usize) -> bool {
match &self {
Self::Scalar(_) | Self::NonNullableArray(_) => true,
Self::NonNullableStringViewArray(_) => true,
Self::NullableArray(array) => array.is_valid(i),
Self::NullableStringViewArray(array) => array.is_valid(i),
}
}

#[inline]
pub fn nulls(&self) -> Option<NullBuffer> {
match &self {
Self::Scalar(_) | Self::NonNullableArray(_) => None,
Self::Scalar(_)
| Self::NonNullableArray(_)
| Self::NonNullableStringViewArray(_) => None,
Self::NullableArray(array) => array.nulls().cloned(),
Self::NullableStringViewArray(array) => array.nulls().cloned(),
}
}
}
Expand Down Expand Up @@ -389,10 +396,20 @@ impl StringArrayBuilder {
.extend_from_slice(array.value(i).as_bytes());
}
}
ColumnarValueRef::NullableStringViewArray(array) => {
if !CHECK_VALID || array.is_valid(i) {
self.value_buffer
.extend_from_slice(array.value(i).as_bytes());
}
}
ColumnarValueRef::NonNullableArray(array) => {
self.value_buffer
.extend_from_slice(array.value(i).as_bytes());
}
ColumnarValueRef::NonNullableStringViewArray(array) => {
self.value_buffer
.extend_from_slice(array.value(i).as_bytes());
}
}
}

Expand Down
99 changes: 80 additions & 19 deletions datafusion/functions/src/string/concat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,13 @@
// specific language governing permissions and limitations
// under the License.

use arrow::array::{Array, StringViewArray};
use arrow::datatypes::DataType;
use std::any::Any;
use std::sync::Arc;

use arrow::datatypes::DataType;
use arrow::datatypes::DataType::Utf8;

use datafusion_common::cast::as_string_array;
use datafusion_common::{internal_err, Result, ScalarValue};
use datafusion_common::cast::{as_string_array, as_string_view_array};
use datafusion_common::{internal_err, plan_err, Result, ScalarValue};
use datafusion_expr::expr::ScalarFunction;
use datafusion_expr::simplify::{ExprSimplifyResult, SimplifyInfo};
use datafusion_expr::{lit, ColumnarValue, Expr, Volatility};
Expand All @@ -46,7 +45,10 @@ impl ConcatFunc {
pub fn new() -> Self {
use DataType::*;
Self {
signature: Signature::variadic(vec![Utf8], Volatility::Immutable),
signature: Signature::variadic(
vec![Utf8, Utf8View, LargeUtf8],
Volatility::Immutable,
),
}
}
}
Expand All @@ -64,13 +66,19 @@ impl ScalarUDFImpl for ConcatFunc {
&self.signature
}

fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType> {
Ok(Utf8)
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> {
use DataType::*;
Ok(match &arg_types[0] {
Utf8View => Utf8View,
LargeUtf8 => LargeUtf8,
_ => Utf8,
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic seems to assume all arguments are of the same type?

also, why not always return Utf8?
the code performing actual concatenation seems to be always the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah @findepi I think the logic is "Whatever the first argument type is the output should be of that type" so if the received values were: Utf8, Utf8View the output would be Utf8. I'm taking the logic from other UDFs and applying it here. It may not be the best way of doing this though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is "Whatever the first argument type is the output should be of that type

i understand this is what's implemented. but not sure why it is so.
what's the exact benefit of presenting the data as string view, if we computed the exact string anyway, and we technically don't need to represent it as a string view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. So what you're saying is that here: https://github.com/apache/datafusion/pull/12224/files#diff-71970189679c6dd5b3b677bb21603234b488e68d1601be9c4d400d40e430a909R204 I'm building a Utf8 string anyways?

So I suspect I should change that bit of code to use a StringViewArrayBuilder?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's my intuition
except that i would keep building StringArray and just declare return type as Utf8 always

from the issue #11836

Currently, a call to CONCAT with a Utf8View datatypes induces a cast. After the change that fixes this issue, it should not.

this is about inputs to the function, not the return type

Side note:
String view could be an interesting return type if we wanted to optimize for single non-null string view input and let it pass-through; but the code doesn't do this today, not sure it's worth implementing for this edge case and it should be independent of arguments order, ie not tied to the first input's type.
end of side note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi what about for LargeUtf8? I suspect that if a LargeUtf8 is the input then the output should also be that since its an i64 datatype vs the i32 datatype for Utf8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know the exact rules for how we handle LargeUtf8.
i focused on the string view portion of the PR. from "string views perspective", LargeUtf8 is non-issue, so IMO it's fine not to change the return type with respect to LargeUtf8 in this PR. but i agree that we probably should return LargeUtf8 when any input is LargeUtf8 (pr what exactly the logic should be).

in fact, what does the binary concat operator do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi when you say binary concat operator are you talking about || as an operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you say binary concat operator are you talking about || as an operator?

yes

the logic seems to assume all arguments are of the same type?

also, why not always return Utf8? the code performing actual concatenation seems to be always the same.

the question still holds (why exactly we bias towards the first param type), but i am no longer convinced about my suggestion to use Utf8 always.

i think we should "just" make sure concat(a, b, c) is type-equivalent to a || b || c.
the ||'s logic apparently is

  • if any of the operands are Utf8View, the result is Utf8View
  • else, if any of the operands are LargeUtf8, the result is LargeUtf8

cc @alamb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good @findepi i like that logic. I can adjust to make it so.

}

/// Concatenates the text representations of all the arguments. NULL arguments are ignored.
/// concat('abcde', 2, NULL, 22) = 'abcde222'
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> {
let args_datatype = args[0].data_type();
devanbenz marked this conversation as resolved.
Show resolved Hide resolved
let array_len = args
.iter()
.filter_map(|x| match x {
Expand All @@ -87,7 +95,21 @@ impl ScalarUDFImpl for ConcatFunc {
result.push_str(v);
}
}
return Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(result))));

return match args_datatype {
DataType::Utf8View => {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8View(Some(result))))
}
DataType::Utf8 => {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(result))))
}
DataType::LargeUtf8 => {
Ok(ColumnarValue::Scalar(ScalarValue::LargeUtf8(Some(result))))
}
other => {
plan_err!("Concat function does not support datatype of {other}")
}
};
}

// Array
Expand All @@ -103,15 +125,40 @@ impl ScalarUDFImpl for ConcatFunc {
columns.push(ColumnarValueRef::Scalar(s.as_bytes()));
}
}
ColumnarValue::Scalar(ScalarValue::Utf8View(maybe_value)) => {
if let Some(s) = maybe_value {
data_size += s.len() * len;
columns.push(ColumnarValueRef::Scalar(s.as_bytes()));
}
}
ColumnarValue::Array(array) => {
let string_array = as_string_array(array)?;
data_size += string_array.values().len();
let column = if array.is_nullable() {
ColumnarValueRef::NullableArray(string_array)
} else {
ColumnarValueRef::NonNullableArray(string_array)
match array.data_type() {
DataType::Utf8 | DataType::LargeUtf8 => {
let string_array = as_string_array(array)?;

data_size += string_array.values().len();
let column = if array.is_nullable() {
ColumnarValueRef::NullableArray(string_array)
} else {
ColumnarValueRef::NonNullableArray(string_array)
};
columns.push(column);
},
DataType::Utf8View => {
let string_array = as_string_view_array(array)?;

data_size += string_array.len();
let column = if array.is_nullable() {
ColumnarValueRef::NullableStringViewArray(string_array)
} else {
ColumnarValueRef::NonNullableStringViewArray(string_array)
};
columns.push(column);
},
other => {
return plan_err!("Input was {other} which is not a supported datatype for concat function")
}
};
columns.push(column);
}
_ => unreachable!(),
}
Expand All @@ -124,7 +171,20 @@ impl ScalarUDFImpl for ConcatFunc {
.for_each(|column| builder.write::<true>(column, i));
builder.append_offset();
}
Ok(ColumnarValue::Array(Arc::new(builder.finish(None))))
let string_array = builder.finish(None);

match args_datatype {
DataType::Utf8 | DataType::LargeUtf8 => {
Ok(ColumnarValue::Array(Arc::new(string_array)))
}
DataType::Utf8View => {
let string_array_iter = string_array.into_iter();
Ok(ColumnarValue::Array(Arc::new(StringViewArray::from_iter(
string_array_iter,
))))
}
_ => unreachable!(),
}
}

/// Simplify the `concat` function by
Expand All @@ -151,11 +211,11 @@ pub fn simplify_concat(args: Vec<Expr>) -> Result<ExprSimplifyResult> {
for arg in args.clone() {
match arg {
// filter out `null` args
Expr::Literal(ScalarValue::Utf8(None) | ScalarValue::LargeUtf8(None)) => {}
Expr::Literal(ScalarValue::Utf8(None) | ScalarValue::LargeUtf8(None) | ScalarValue::Utf8View(None)) => {}
// All literals have been converted to Utf8 or LargeUtf8 in type_coercion.
// Concatenate it with the `contiguous_scalar`.
Expr::Literal(
ScalarValue::Utf8(Some(v)) | ScalarValue::LargeUtf8(Some(v)),
ScalarValue::Utf8(Some(v)) | ScalarValue::LargeUtf8(Some(v)) | ScalarValue::Utf8View(Some(v)),
) => contiguous_scalar += &v,
Expr::Literal(x) => {
return internal_err!(
Expand Down Expand Up @@ -197,6 +257,7 @@ mod tests {
use crate::utils::test::test_function;
use arrow::array::Array;
use arrow::array::{ArrayRef, StringArray};
use DataType::*;

#[test]
fn test_functions() -> Result<()> {
Expand Down
49 changes: 46 additions & 3 deletions datafusion/sqllogictest/test_files/string_view.slt
Original file line number Diff line number Diff line change
Expand Up @@ -768,17 +768,26 @@ logical_plan
01)Projection: character_length(test.column1_utf8view) AS l
02)--TableScan: test projection=[column1_utf8view]

## Ensure no casts for CONCAT
## TODO https://github.com/apache/datafusion/issues/11836
## Ensure no casts for CONCAT Utf8View
query TT
EXPLAIN SELECT
concat(column1_utf8view, column2_utf8view) as c
FROM test;
----
logical_plan
01)Projection: concat(CAST(test.column1_utf8view AS Utf8), CAST(test.column2_utf8view AS Utf8)) AS c
01)Projection: concat(test.column1_utf8view, test.column2_utf8view) AS c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💪

02)--TableScan: test projection=[column1_utf8view, column2_utf8view]

## Ensure no casts for CONCAT LargeUtf8
query TT
EXPLAIN SELECT
concat(column1_large_utf8, column2_large_utf8) as c
FROM test;
----
logical_plan
01)Projection: concat(test.column1_large_utf8, test.column2_large_utf8) AS c
02)--TableScan: test projection=[column1_large_utf8, column2_large_utf8]

## Ensure no casts for CONCAT_WS
## TODO https://github.com/apache/datafusion/issues/11837
query TT
Expand Down Expand Up @@ -863,6 +872,39 @@ XIANGPENG
RAPHAEL
NULL

## Should run CONCAT successfully
query T
SELECT
concat(column1_utf8view, column2_utf8view) as c
FROM test;
----
AndrewX
XiangpengXiangpeng
RaphaelR
R

## Should run CONCAT successfully with utf8 and utf8view
query T
SELECT
concat(column1_utf8view, column2_utf8) as c
FROM test;
----
AndrewX
XiangpengXiangpeng
RaphaelR
R

## Should run CONCAT successfully with utf8 utf8view and largeutf8
query T
SELECT
concat(column1_utf8view, column2_utf8, column2_large_utf8) as c
FROM test;
----
AndrewXX
XiangpengXiangpengXiangpeng
RaphaelRR
RR

## Ensure no casts for LPAD
query TT
EXPLAIN SELECT
Expand Down Expand Up @@ -1307,3 +1349,4 @@ select column2|| ' ' ||column3 from temp;
----
rust fast
datafusion cool