-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC] Decouple logical from physical types #12622
Comments
I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document. cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort. |
I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community. |
We should make it a goal that physical planning also abstracts over physical representation of individual batches. We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635 |
While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)). I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas. |
Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans. |
What follows is my idea of the plan we might want to take to tackle this issue. Status and goalsCurrent status
#11513 initial goal
"Record batches as the physical type source" goal
How do we get there?
Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical. Towards a Logical LogicalPlanThe
|
This seems like a good first step to me (and also a massive project in itself)
I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete
I think UDFs will need to retain physical information (at least in All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen! |
I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi |
I can work on the logical type that replace current function signature on |
@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature? |
Sure |
@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you
@notfilippo awesome! |
Can a committer merge #13016 on |
Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge? |
It would be great! I think we need to rebase the branch first then remove scalarvlaue utf8view and largeutf8 |
Sounds good! I've started updating this branch to the main branch on my fork. I still need to iron out some issues, but I hope to create a PR tomorrow. |
Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion). As I understand it, the new Unfortunately, using Even this may be acceptable if the resulting solution is stable for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided. Therefore, are we sure that we require the I also experimented with adding a new variant
Adopting this strategy would mean that we cannot release logical types until we can get rid of the Any thoughts on that? |
I think downstream project are not all forced to switch to
The following is equivalent ScalarValue::Utf8View
Scalar {
ScalarValue::Utf8(String),
DataType::Utf8View
} ScalarValue::LargeUtf8
Scalar {
ScalarValue::Utf8(String),
DataType::LargeUtf8
}
I agree, If we can find such approach it would be great. |
Sorry I wasn't clear on this. I meant that matching directly on let take_idx = match &args[2] {
ColumnarValue::Scalar(ScalarValue::Int64(Some(v))) if v < &2 => *v as usize,
_ => unreachable!(),
}; becomes let take_idx = match &args[2] {
ColumnarValue::Scalar(scalar) => match scalar.value() {
ScalarValue::Int64(Some(v)) if v < &2 => *v as usize,
_ => unreachable!(),
},
_ => unreachable!(),
} And this isn't a deal breaker for me as one may argue that distinguishing between scalar/non-scalar and scalar types are two pair of shoes and should be handled separately, but at least in some cases this will impact ergonomics (e.g., tests). Maybe this also isn't such a big deal if downstream projects are not using these types of patterns.
I also don't know a good solution and maybe EDIT: |
Pattern matching is not impossible match value {
ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Int64(_)) && scalar.as_i64() > &2 => {
}
ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Utf8(_)) && scalar.as_string().as_str() == "datafusion" => {
}
} pub fn as_i64_opt(&self) -> Option<&i64> {
match self.value() {
ScalarValue::Int64(v) => v.as_ref(),
_ => None,
}
}
pub fn as_i64(&self) -> &i64 {
self.as_i64_opt().unwrap()
}
pub fn as_string_opt(&self) -> Option<&String> {
match self.value() {
ScalarValue::Utf8(v) => v.as_ref(),
_ => None,
}
}
pub fn as_string(&self) -> &String {
self.as_string_opt().unwrap()
} |
Is your feature request related to a problem or challenge?
This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:
Describe the solution you'd like
Make
ScalarValue
values logical:Scalar
type for ColumnarValue #12536 (merged inlogical-types
)logical-types
)ScalarValue::LargeUtf8
andScalarValue::Utf8View
in favour ofScalarValue::Utf8
ScalarValue::LargeBinary
andScalarValue::BinaryView
in favour ofScalarValue::Binary
ScalarValue::Dictionary
(from Remove ScalarValue::Dictionary #12488)The text was updated successfully, but these errors were encountered: