Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid cast numbers to string in filter #9671

Open
waynexia opened this issue Mar 18, 2024 · 4 comments
Open

Avoid cast numbers to string in filter #9671

waynexia opened this issue Mar 18, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@waynexia
Copy link
Member

waynexia commented Mar 18, 2024

Is your feature request related to a problem or challenge?

It's not always an expected behavior that casts a number to string in places like WHERE. E.g.:

Create a table with an integer and a string columns.

create table t as (values (1,'1'), (2,'2'),(3,'3'));

Behavior for queries like the following is clear, as the provided constant has the same type with column:

select * from t where column1 > 2;
select * from t where column2 > '2';

But when comparing column1 against a string, like '2' or 'a', type_coercion will add a cast to the column, resulting a plan like this:

❯ explain select * from t where column1 > '2';
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(t.column1 AS Utf8) > Utf8("2")       |
|               |   TableScan: t projection=[column1, column2]      |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) > 2         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+

This behavior is a little confusing. It will not only slow the execution (filter pushdown in parquet won't work), but the result will also be surprising (12 < '2' because we are comparing two strings).

Describe the solution you'd like

Do not coerce everything to string when comparing another type with it. We should be careful when coercion happens across two type families (like integer to unsigned integer, or float to string etc.).

My proposal in this case is to consider the type of column first. E.g., try to coerce the string literal into the type of column. And throw an error this coercion fails. This can prevent us from executing an incorrect plan in silence. As user can always add their own specific CAST if they want to do string comparison.

This is also closer to PG's behavior:

postgres=# select 12 > '2';
 ?column? 
----------
 t
(1 row)

postgres=# select 12 > 'a';
ERROR:  invalid input syntax for type integer: "a"
LINE 1: select 12 > 'a';

Describe alternatives you've considered

MySQL would also try to cast string to integer. But it won't throw an error when this cast fails. Instead, it would failover the string value to the default number 0 which is more strange...

An alternative is when the cast fails, we fallback to upcast. I.e., select 12 > '2' would treat '2' as integer and select 12 > 'a' would cast 12 into '12'. This can reduce some breaking cases.

Additional context

String is the "biggest" type in coercion, but in this case I realize it's not always acceptable to cast something to string. I would start working on type_coercion rule. I'm not sure if there are any other places that have a similar issue.

And BTW this would be a breaking change.

@waynexia waynexia added the enhancement New feature or request label Mar 18, 2024
@waynexia waynexia self-assigned this Mar 18, 2024
@waynexia
Copy link
Member Author

PG won't do the cast even if the column type is string and literal is number:

postgres=# select * from t where column2 > 1;
ERROR:  operator does not exist: text > integer
LINE 1: select * from t where column2 > 1;
                                      ^
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.

@waynexia
Copy link
Member Author

Exprs like SELECT '1' IN ('a','b',1) from datafusion/sqllogictest/test_files/expr.slt would also fail

@alamb
Copy link
Contributor

alamb commented Mar 18, 2024

Given I suspect there is no one agreed upon "standard" for this particular behavior, perhaps we can add a configuration option to control if strings are coerced to numeric types 🤔

@waynexia
Copy link
Member Author

Yes, I'm afraid we have to provide both behaviors controlled by a configuration. Otherwise this breaks lots of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants