-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Utf8View to string_to_array and array_to_string #13403
Conversation
# Conflicts: # datafusion/sqllogictest/test_files/string/string_view.slt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Do not merge - this PR does not properly handle largeutf8 and I want to try to refactor the code to reduce the # of lines of code |
Converting to draft while @Omega359 makes the changes described in #13403 (comment) |
Ready for review again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like an improvement in functionality to me -- thank you @Omega359
I do wonder if we really need this level of specialization, but I also think since this is clearly better than what is on main we can/should merge it in
|
||
let list_array = list_builder.finish(); | ||
Ok(Arc::new(list_array) as ArrayRef) | ||
} | ||
|
||
trait StringArrayBuilderType: ArrayBuilder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting idea -- maybe something worth looking into making more visible somehow / reusing to reduce duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really wish Rust had enum variants as types - with that I could really reduce a lot of this code. One of the reasons this took so long is that I was looking for a clean and fast way to handle the string types in a super nice way. I failed unfortunately - what I came come up with similar to https://github.com/Omega359/arrow-datafusion/blob/38d9bdde69381b4a7516346f862310700b6eb96e/datafusion/functions/src/utils.rs#L192 but that just has more overhead than is acceptable :(
@@ -6916,6 +6940,79 @@ select string_to_array(e, ',') from values; | |||
[adipiscing] | |||
NULL | |||
|
|||
# karge string tests for string_to_array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# karge string tests for string_to_array | |
# large string tests for string_to_array |
} | ||
} | ||
|
||
fn string_to_array_inner_2<'a, StringArrType, StringBuilderType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inner_2 😆
|
||
match args.len() { | ||
2 => { | ||
fn string_to_array_inner_3<'a, StringArrType, DelimiterArrType, StringBuilderType>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am reading this correctly, this function effectively handles all possible combinations of Utf8
, Utf8View
and LargeUtf8
by generating specialized implementations for each combination -- so I think that means 3 * 3 = 9 copies of the function
I wonder if it is common to mix all the types 🤔
That is a very good point @alamb and one that I don't have an answer to. Is there any guidelines anywhere as to what should be covered and what shouldn't? From my reading of functions its all over the place with no consistency at all. |
I agree -- there is no guideline that I know of. Any chance you would be willing to propose one? |
Let's go with this and iterate on improvements going forward |
I'll have to think about it tbh @alamb. Part of that is to know the cost to go from utf8 -> utf8View / utf8view -> utf8. I remember seeing a comment somewhere but my search-fu isn't advanced enough to find it again |
Which issue does this PR close?
Closes #13383
Rationale for this change
Completing support for Utf8View in all functions.
What changes are included in this PR?
Code tests
Are these changes tested?
Yes
Are there any user-facing changes?
No