-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update the CONCAT scalar function to support Utf8View #12224
Merged
alamb
merged 18 commits into
apache:main
from
devanbenz:feat/11836-update-concat-utf8view
Sep 3, 2024
Merged
Changes from 8 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
723ceb7
wip
devanbenz f7abdd5
feat: Update the CONCAT scalar function to support Utf8View
devanbenz 503d5b9
fmt
devanbenz b30330d
fmt and add default return type for concat
devanbenz 76d6b5f
fix clippy lint
devanbenz 9798ed3
fmt
devanbenz 91d04ff
add more tests for sqllogic
devanbenz 6d28927
make sure no casting with LargeUtf8
devanbenz 769d99d
fixing utf8large
devanbenz ac30a83
fix large utf8
devanbenz 7ea6e0a
fix large utf8
devanbenz dd3ad39
add test
devanbenz 504459c
fmt
devanbenz e081934
make it so Utf8View just returns Utf8
devanbenz 0069c1a
wip -- trying to build a stringview with columnar refs
devanbenz 0929a4b
built stringview builder but it does allocate a new String each iter :(
devanbenz f16de44
add some testing
devanbenz d0bf3ba
clippy
devanbenz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -768,17 +768,26 @@ logical_plan | |
01)Projection: character_length(test.column1_utf8view) AS l | ||
02)--TableScan: test projection=[column1_utf8view] | ||
|
||
## Ensure no casts for CONCAT | ||
## TODO https://github.com/apache/datafusion/issues/11836 | ||
## Ensure no casts for CONCAT Utf8View | ||
query TT | ||
EXPLAIN SELECT | ||
concat(column1_utf8view, column2_utf8view) as c | ||
FROM test; | ||
---- | ||
logical_plan | ||
01)Projection: concat(CAST(test.column1_utf8view AS Utf8), CAST(test.column2_utf8view AS Utf8)) AS c | ||
01)Projection: concat(test.column1_utf8view, test.column2_utf8view) AS c | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💪 |
||
02)--TableScan: test projection=[column1_utf8view, column2_utf8view] | ||
|
||
## Ensure no casts for CONCAT LargeUtf8 | ||
query TT | ||
EXPLAIN SELECT | ||
concat(column1_large_utf8, column2_large_utf8) as c | ||
FROM test; | ||
---- | ||
logical_plan | ||
01)Projection: concat(test.column1_large_utf8, test.column2_large_utf8) AS c | ||
02)--TableScan: test projection=[column1_large_utf8, column2_large_utf8] | ||
|
||
## Ensure no casts for CONCAT_WS | ||
## TODO https://github.com/apache/datafusion/issues/11837 | ||
query TT | ||
|
@@ -863,6 +872,39 @@ XIANGPENG | |
RAPHAEL | ||
NULL | ||
|
||
## Should run CONCAT successfully | ||
query T | ||
SELECT | ||
concat(column1_utf8view, column2_utf8view) as c | ||
FROM test; | ||
---- | ||
AndrewX | ||
XiangpengXiangpeng | ||
RaphaelR | ||
R | ||
|
||
## Should run CONCAT successfully with utf8 and utf8view | ||
query T | ||
SELECT | ||
concat(column1_utf8view, column2_utf8) as c | ||
FROM test; | ||
---- | ||
AndrewX | ||
XiangpengXiangpeng | ||
RaphaelR | ||
R | ||
|
||
## Should run CONCAT successfully with utf8 utf8view and largeutf8 | ||
query T | ||
SELECT | ||
concat(column1_utf8view, column2_utf8, column2_large_utf8) as c | ||
FROM test; | ||
---- | ||
AndrewXX | ||
XiangpengXiangpengXiangpeng | ||
RaphaelRR | ||
RR | ||
|
||
## Ensure no casts for LPAD | ||
query TT | ||
EXPLAIN SELECT | ||
|
@@ -1307,3 +1349,4 @@ select column2|| ' ' ||column3 from temp; | |
---- | ||
rust fast | ||
datafusion cool | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic seems to assume all arguments are of the same type?
also, why not always return
Utf8
?the code performing actual concatenation seems to be always the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah @findepi I think the logic is "Whatever the first argument type is the output should be of that type" so if the received values were: Utf8, Utf8View the output would be Utf8. I'm taking the logic from other UDFs and applying it here. It may not be the best way of doing this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand this is what's implemented. but not sure why it is so.
what's the exact benefit of presenting the data as string view, if we computed the exact string anyway, and we technically don't need to represent it as a string view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay. So what you're saying is that here: https://github.com/apache/datafusion/pull/12224/files#diff-71970189679c6dd5b3b677bb21603234b488e68d1601be9c4d400d40e430a909R204 I'm building a Utf8 string anyways?
So I suspect I should change that bit of code to use a StringViewArrayBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that's my intuition
except that i would keep building StringArray and just declare return type as Utf8 always
from the issue #11836
this is about inputs to the function, not the return type
Side note:
String view could be an interesting return type if we wanted to optimize for single non-null string view input and let it pass-through; but the code doesn't do this today, not sure it's worth implementing for this edge case and it should be independent of arguments order, ie not tied to the first input's type.
end of side note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi what about for LargeUtf8? I suspect that if a LargeUtf8 is the input then the output should also be that since its an i64 datatype vs the i32 datatype for Utf8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't know the exact rules for how we handle LargeUtf8.
i focused on the string view portion of the PR. from "string views perspective", LargeUtf8 is non-issue, so IMO it's fine not to change the return type with respect to LargeUtf8 in this PR. but i agree that we probably should return LargeUtf8 when any input is LargeUtf8 (pr what exactly the logic should be).
in fact, what does the binary concat operator do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@findepi when you say binary concat operator are you talking about
||
as an operator?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
the question still holds (why exactly we bias towards the first param type), but i am no longer convinced about my suggestion to use Utf8 always.
i think we should "just" make sure
concat(a, b, c)
is type-equivalent toa || b || c
.the
||
's logic apparently iscc @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good @findepi i like that logic. I can adjust to make it so.