Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Type Coercion for UDF Arguments #14268

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

shehabgamin
Copy link
Contributor

@shehabgamin shehabgamin commented Jan 24, 2025

Which issue does this PR close?

Closes #14230

Rationale for this change

A bug was introduced in DataFusion v43.0.0 that affects type coercion for UDF arguments. Sail's tests uncovered several of these regressions, which required explicit casting in multiple areas as a workaround during the upgrade to DataFusion 43.0.0.

The regressions identified by Sail's tests include the following functions:

  1. ascii
  2. bit_length
  3. contains
  4. ends_with
  5. starts_with
  6. octet_length

Upon digging into the code, I discovered the following:

  1. The above functions didn't have Signature::coercible.
  2. Signature::coercible was incomplete. Coercion would only happen if logical_type == target_type, logical_type == NativeType::Null, or target_type.is_integer() && logical_type.is_integer().
    1. It seems like this was unintentional because the doc comments specified behavior that was not consistent with the logic implemented:

      For example, Coercible(vec![logical_float64()]) accepts arguments like vec![Int32] or vec![Float32] since i32 and f32 can be cast to f64.

What changes are included in this PR?

  1. Update functions from the list above to have Signature::coercible and port the relevant tests from Sail.
  2. Fix TypeSignature::Coercible.

Are these changes tested?

Yes.

Are there any user-facing changes?

signature: Signature::one_of(
vec![
TypeSignature::String(1),
TypeSignature::Coercible(vec![TypeSignatureClass::Native(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use coercible(string), we don't need string since it is a more strict rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I initially did, but after testing on Sail, I discovered new test failures related to coercing input that's all String (e.g. func(Utf8, Utf8View)).

The plan is to port all the relevant tests from Sail into this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it was not my intention to apply this pattern on single arg functions. I'll get that fixed!

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design for coercion may still have room for improvement. It would be beneficial to represent the function signature in a simpler and more concise manner, rather than relying on complex combinations of multiple, similar signatures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! I'll add that in.

@shehabgamin shehabgamin changed the title Fix DF 43 Regression: Coerce Various Scalar Func Args to String Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@github-actions github-actions bot added the common Related to common crate label Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Coercion Regression: Improve TypeSignature::Coercible Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Jan 25, 2025
@shehabgamin shehabgamin changed the title Fix DF 43 Type Coercion Regression: Improve TypeSignature::Coercible Fix Type Coercion for UDF Arguments Jan 25, 2025
@shehabgamin shehabgamin mentioned this pull request Jan 25, 2025
29 tasks
@github-actions github-actions bot added the optimizer Optimizer rules label Jan 25, 2025
}

#[test]
fn test_ascii_expr() -> Result<()> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have preferred to place the various UDF tests within their respective files, but I couldn't due to circular dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ending up putting the tests in the .slt file, but figured we can still leave this test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test if the purpose of the test is covered already in slt

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 26, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 26, 2025
@shehabgamin shehabgamin marked this pull request as ready for review January 26, 2025 09:45

/// Recursively traverses [`DataType::Dictionary`] to get the underlying value [`DataType`].
/// For non-dictionary types, returns the default [`DataType`].
fn base_dictionary_type_or_default_type(data_type: &DataType) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add dictionary support to datafusion/common/src/utils/mod.rs base_type

50

query I
SELECT ascii(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DuckDB, this doesn't work

D select ascii(2);
Binder Error: No function matches the given name and argument types 'ascii(INTEGER_LITERAL)'. You might need to add explicit type casts.
	Candidate functions:
	ascii(VARCHAR) -> INTEGER

LINE 1: select ascii(2);
               ^
D select ascii('2');
┌────────────┐
│ ascii('2') │
│   int32    │
├────────────┤
│         50 │
└────────────┘

So does Postgres

postgres=# select ascii(2);
ERROR:  function ascii(integer) does not exist
LINE 1: select ascii(2);
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
postgres=# select ascii('2');
 ascii 
-------
    50
(1 row)

This used to work in DataFusion, but it is not consistent with other systems. I'm not sure whether we should introduce this behaviour back

Copy link
Contributor Author

@shehabgamin shehabgamin Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is consistent with Spark and the tests also come from the Spark codebase (which we ported to Sail then here lol).

spark-sql (default)> SELECT ASCII(2);
50
Time taken: 0.952 seconds, Fetched 1 row(s)
spark-sql (default)> SELECT ASCII('2');
50
Time taken: 0.044 seconds, Fetched 1 row(s)
spark-sql (default)>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this belongs to Spark-functions crate instead of datafusion core?

@@ -571,8 +601,10 @@ select repeat('-1.2', arrow_cast(3, 'Int32'));
----
-1.2-1.2-1.2

query error DataFusion error: Error during planning: Internal error: Expect TypeSignatureClass::Native\(LogicalType\(Native\(Int64\), Int64\)\) but received Float64
query T
select repeat('-1.2', 3.2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be error, the 2nd argument should be integer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the 2nd arg should be TypeSignatureClass::Integer instead of TypeSignatureClass::Native(logical_int64())? It's currently unimplemented with a TODO:

// TODO:
// Numeric
// Integer

Otherwise, it would seem strange to deviate from the behavior specified in the implementation of default_cast_forfor NativeType.

I do think that we should prioritize being permissive and general-purpose though as I was discussing here: #14296 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think non-integer 2nd arg should return error

}

#[test]
fn test_ascii_expr() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this test if the purpose of the test is covered already in slt

@@ -584,23 +544,7 @@ fn get_valid_types(
match target_type_class {
TypeSignatureClass::Native(native_type) => {
let target_type = native_type.native();
if &logical_type == target_type {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean others function that used Coercible String now also cast integer to string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's TypeSignature::Coercible with a TypeSignatureClass::Native(logical_string()), then yes. Any function that specifies TypeSignature::Coercible with a TypeSignatureClass::Native should coerce according to the behavior implemented in the default_cast_for function for NativeType in order to be consistent.

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeSignature::Coercible is designed to cast between the same logical type, but now it casts to any kind of type, I don't think this is ideal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, in this PR, SELECT bit_length(12); now return 16 instead of error, but I think we should error. Any unexpected type is valid which doesn't seem correct

Copy link
Contributor Author

@shehabgamin shehabgamin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeSignature::Coercible is designed to cast between the same logical type, but now it casts to any kind of type, I don't think this is ideal.

This is what the doc comments say:

/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
///
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
///
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
Coercible(Vec<TypeSignatureClass>),

Link:

/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
///
/// For example, `Coercible(vec![logical_float64()])` accepts
/// arguments like `vec![Int32]` or `vec![Float32]`
/// since i32 and f32 can be cast to f64
///
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
Coercible(Vec<TypeSignatureClass>),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shehabgamin
I suggest with revert to the previous change and add binary->string casting given most of the string function allow this kind of casting, but not int->string like ascii(2)

For Signature::String, we can make it a convenient wrapper of Coercible(string, string) or deprecate it.

Copy link
Contributor

@jayzhan211 jayzhan211 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule of thumb, stick to either Postgres or DuckDB for functions, and avoid overly broad type casting that allows any type to convert to any other. This can turn invalid queries into valid ones, making errors harder to catch, especially when tests don’t cover all cases.

Based on that,

  1. Cast from Binary to String 👍🏻
  2. Cast from Numeric to String 👎🏻 (At least for the string functions changed in this PR)
  3. Cast from Float to Integer 👎🏻 (At least not for repeat)

@github-actions github-actions bot removed the physical-expr Physical Expressions label Jan 27, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shehabgamin for this contribution.

Thank you to @jayzhan211 -- from my perspective this PR is well written and tested and fixes some regressions

Unless there are any concerns, I'll plan to merge this tomorrow.

@@ -59,7 +64,13 @@ impl Default for ContainsFunc {
impl ContainsFunc {
pub fn new() -> Self {
Self {
signature: Signature::string(2, Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 / @notfilippo should we deprecate Signature::string and direct people to Signature::coerceable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm quite confuse why we need to change to coercible(string, string) in this PR, String(2) is equivalent to Coercible(string, string) in my mind, so I don't quite follow the change at all

Copy link
Contributor Author

@shehabgamin shehabgamin Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Signature::string(x) all input args must be String or Dictionary with a String value (Unchanged in this PR). For Signature::Coercible(string), the input args must be able to cast to a String (dictated via default_cast_for).

@@ -55,7 +58,10 @@ impl Default for BitLengthFunc {
impl BitLengthFunc {
pub fn new() -> Self {
Self {
signature: Signature::string(1, Volatility::Immutable),
signature: Signature::coercible(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a minor comment this new signature is quite a bit of a mouthful compared to Signature::string

It isn't clear to me from reading the code / comments when one would prefer to use one over the other (not related to this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here:
#14268 (comment)

But yes I agree, the documentation should be clearer. I had to dig into the code to get a solid understanding of the behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate documentation Improvements or additions to documentation functions logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFusion Regression (Starting in v43): Type Coercion for UDF Arguments
3 participants