Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Utf8View to string_to_array and array_to_string #13403

Merged
merged 7 commits into from
Nov 21, 2024

Conversation

Omega359
Copy link
Contributor

Which issue does this PR close?

Closes #13383

Rationale for this change

Completing support for Utf8View in all functions.

What changes are included in this PR?

Code tests

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 14, 2024
@Omega359 Omega359 marked this pull request as ready for review November 14, 2024 02:44
datafusion/functions-nested/src/string.rs Outdated Show resolved Hide resolved
datafusion/functions-nested/src/string.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Omega359
Copy link
Contributor Author

Do not merge - this PR does not properly handle largeutf8 and I want to try to refactor the code to reduce the # of lines of code

@alamb alamb marked this pull request as draft November 18, 2024 20:21
@alamb
Copy link
Contributor

alamb commented Nov 18, 2024

Converting to draft while @Omega359 makes the changes described in #13403 (comment)

@Omega359 Omega359 marked this pull request as ready for review November 20, 2024 13:55
@Omega359
Copy link
Contributor Author

Ready for review again.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like an improvement in functionality to me -- thank you @Omega359

I do wonder if we really need this level of specialization, but I also think since this is clearly better than what is on main we can/should merge it in


let list_array = list_builder.finish();
Ok(Arc::new(list_array) as ArrayRef)
}

trait StringArrayBuilderType: ArrayBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting idea -- maybe something worth looking into making more visible somehow / reusing to reduce duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really wish Rust had enum variants as types - with that I could really reduce a lot of this code. One of the reasons this took so long is that I was looking for a clean and fast way to handle the string types in a super nice way. I failed unfortunately - what I came come up with similar to https://github.com/Omega359/arrow-datafusion/blob/38d9bdde69381b4a7516346f862310700b6eb96e/datafusion/functions/src/utils.rs#L192 but that just has more overhead than is acceptable :(

@@ -6916,6 +6940,79 @@ select string_to_array(e, ',') from values;
[adipiscing]
NULL

# karge string tests for string_to_array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# karge string tests for string_to_array
# large string tests for string_to_array

}
}

fn string_to_array_inner_2<'a, StringArrType, StringBuilderType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inner_2 😆


match args.len() {
2 => {
fn string_to_array_inner_3<'a, StringArrType, DelimiterArrType, StringBuilderType>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, this function effectively handles all possible combinations of Utf8, Utf8View and LargeUtf8 by generating specialized implementations for each combination -- so I think that means 3 * 3 = 9 copies of the function

I wonder if it is common to mix all the types 🤔

@Omega359
Copy link
Contributor Author

I do wonder if we really need this level of specialization, but I also think since this is clearly better than what is on main we can/should merge it in

That is a very good point @alamb and one that I don't have an answer to. Is there any guidelines anywhere as to what should be covered and what shouldn't? From my reading of functions its all over the place with no consistency at all.

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

That is a very good point @alamb and one that I don't have an answer to. Is there any guidelines anywhere as to what should be covered and what shouldn't? From my reading of functions its all over the place with no consistency at all.

I agree -- there is no guideline that I know of. Any chance you would be willing to propose one?

@alamb
Copy link
Contributor

alamb commented Nov 21, 2024

Let's go with this and iterate on improvements going forward

@alamb alamb merged commit dd4fa79 into apache:main Nov 21, 2024
25 checks passed
@Omega359
Copy link
Contributor Author

I agree -- there is no guideline that I know of. Any chance you would be willing to propose one?

I'll have to think about it tbh @alamb. Part of that is to know the cost to go from utf8 -> utf8View / utf8view -> utf8. I remember seeing a comment somewhere but my search-fu isn't advanced enough to find it again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Utf8View to array_to_string, string_to_array functions
4 participants