-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Type Coercion for UDF Arguments #14268
base: main
Are you sure you want to change the base?
Conversation
signature: Signature::one_of( | ||
vec![ | ||
TypeSignature::String(1), | ||
TypeSignature::Coercible(vec![TypeSignatureClass::Native( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use coercible(string)
, we don't need string
since it is a more strict rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)
).
The plan is to port all the relevant tests from Sail into this PR!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 You can find the test failures here if interested!
lakehq/sail@372e13b#diff-bb36d996163d98235e107f8203c9c24be34ef71c84f8afa4a420a8e483102e3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed! I'll add that in.
} | ||
|
||
#[test] | ||
fn test_ascii_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ending up putting the tests in the .slt
file, but figured we can still leave this test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test if the purpose of the test is covered already in slt
|
||
/// Recursively traverses [`DataType::Dictionary`] to get the underlying value [`DataType`]. | ||
/// For non-dictionary types, returns the default [`DataType`]. | ||
fn base_dictionary_type_or_default_type(data_type: &DataType) -> DataType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add dictionary support to datafusion/common/src/utils/mod.rs
base_type
50 | ||
|
||
query I | ||
SELECT ascii(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In DuckDB, this doesn't work
D select ascii(2);
Binder Error: No function matches the given name and argument types 'ascii(INTEGER_LITERAL)'. You might need to add explicit type casts.
Candidate functions:
ascii(VARCHAR) -> INTEGER
LINE 1: select ascii(2);
^
D select ascii('2');
┌────────────┐
│ ascii('2') │
│ int32 │
├────────────┤
│ 50 │
└────────────┘
So does Postgres
postgres=# select ascii(2);
ERROR: function ascii(integer) does not exist
LINE 1: select ascii(2);
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii('2');
ascii
-------
50
(1 row)
This used to work in DataFusion, but it is not consistent with other systems. I'm not sure whether we should introduce this behaviour back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is consistent with Spark and the tests also come from the Spark codebase (which we ported to Sail then here lol).
spark-sql (default)> SELECT ASCII(2);
50
Time taken: 0.952 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT ASCII('2');
50
Time taken: 0.044 seconds, Fetched 1 row(s)
spark-sql (default)>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this belongs to Spark-functions crate instead of datafusion core?
@@ -571,8 +601,10 @@ select repeat('-1.2', arrow_cast(3, 'Int32')); | |||
---- | |||
-1.2-1.2-1.2 | |||
|
|||
query error DataFusion error: Error during planning: Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64 | |||
query T | |||
select repeat('-1.2', 3.2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be error, the 2nd argument should be integer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 2nd arg should be TypeSignatureClass::Integer
instead of TypeSignatureClass::Native(logical_int64())
? It's currently unimplemented with a TODO
:
datafusion/datafusion/expr-common/src/signature.rs
Lines 216 to 218 in f775791
// TODO: | |
// Numeric | |
// Integer |
Otherwise, it would seem strange to deviate from the behavior specified in the implementation of default_cast_for
for NativeType
.
I do think that we should prioritize being permissive and general-purpose though as I was discussing here: #14296 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think non-integer 2nd arg should return error
} | ||
|
||
#[test] | ||
fn test_ascii_expr() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this test if the purpose of the test is covered already in slt
@@ -584,23 +544,7 @@ fn get_valid_types( | |||
match target_type_class { | |||
TypeSignatureClass::Native(native_type) => { | |||
let target_type = native_type.native(); | |||
if &logical_type == target_type { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean others function that used Coercible String now also cast integer to string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's TypeSignature::Coercible
with a TypeSignatureClass::Native(logical_string())
, then yes. Any function that specifies TypeSignature::Coercible
with a TypeSignatureClass::Native
should coerce according to the behavior implemented in the default_cast_for
function for NativeType
in order to be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeSignature::Coercible
is designed to cast between the same logical type, but now it casts to any kind of type, I don't think this is ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, in this PR, SELECT bit_length(12);
now return 16 instead of error, but I think we should error. Any unexpected type is valid which doesn't seem correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TypeSignature::Coercible
is designed to cast between the same logical type, but now it casts to any kind of type, I don't think this is ideal.
This is what the doc comments say:
/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
///
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
///
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
Coercible(Vec<TypeSignatureClass>),
Link:
datafusion/datafusion/expr-common/src/signature.rs
Lines 127 to 134 in a4917d4
/// One or more arguments belonging to the [`TypeSignatureClass`], in order. | |
/// | |
/// For example, `Coercible(vec![logical_float64()])` accepts | |
/// arguments like `vec![Int32]` or `vec![Float32]` | |
/// since i32 and f32 can be cast to f64 | |
/// | |
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`]. | |
Coercible(Vec<TypeSignatureClass>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shehabgamin
I suggest with revert to the previous change and add binary->string casting given most of the string function allow this kind of casting, but not int->string like ascii(2)
For Signature::String, we can make it a convenient wrapper of Coercible(string, string)
or deprecate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule of thumb, stick to either Postgres or DuckDB for functions, and avoid overly broad type casting that allows any type to convert to any other. This can turn invalid queries into valid ones, making errors harder to catch, especially when tests don’t cover all cases.
Based on that,
- Cast from Binary to String 👍🏻
- Cast from Numeric to String 👎🏻 (At least for the string functions changed in this PR)
- Cast from Float to Integer 👎🏻 (At least not for
repeat
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @shehabgamin for this contribution.
Thank you to @jayzhan211 -- from my perspective this PR is well written and tested and fixes some regressions
Unless there are any concerns, I'll plan to merge this tomorrow.
@@ -59,7 +64,13 @@ impl Default for ContainsFunc { | |||
impl ContainsFunc { | |||
pub fn new() -> Self { | |||
Self { | |||
signature: Signature::string(2, Volatility::Immutable), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 / @notfilippo should we deprecate Signature::string
and direct people to Signature::coerceable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm quite confuse why we need to change to coercible(string, string) in this PR, String(2)
is equivalent to Coercible(string, string)
in my mind, so I don't quite follow the change at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Signature::string(x)
all input args must be String
or Dictionary
with a String
value (Unchanged in this PR). For Signature::Coercible(string)
, the input args must be able to cast to a String
(dictated via default_cast_for
).
@@ -55,7 +58,10 @@ impl Default for BitLengthFunc { | |||
impl BitLengthFunc { | |||
pub fn new() -> Self { | |||
Self { | |||
signature: Signature::string(1, Volatility::Immutable), | |||
signature: Signature::coercible( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a minor comment this new signature is quite a bit of a mouthful compared to Signature::string
It isn't clear to me from reading the code / comments when one would prefer to use one over the other (not related to this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See here:
#14268 (comment)
But yes I agree, the documentation should be clearer. I had to dig into the code to get a solid understanding of the behavior.
Which issue does this PR close?
Closes #14230
Rationale for this change
A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.
The regressions identified by Sail's tests include the following functions:
ascii
bit_length
contains
ends_with
starts_with
octet_length
Upon digging into the code, I discovered the following:
Signature::coercible
.Signature::coercible
was incomplete. Coercion would only happen iflogical_type == target_type
,logical_type == NativeType::Null
, ortarget_type.is_integer() && logical_type.is_integer()
.What changes are included in this PR?
Signature::coercible
and port the relevant tests from Sail.TypeSignature::Coercible
.Are these changes tested?
Yes.
Are there any user-facing changes?