Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC] Decouple logical from physical types #12622

Open
3 of 6 tasks
notfilippo opened this issue Sep 25, 2024 · 22 comments
Open
3 of 6 tasks

[EPIC] Decouple logical from physical types #12622

notfilippo opened this issue Sep 25, 2024 · 22 comments
Labels
enhancement New feature or request

Comments

@notfilippo
Copy link
Contributor

notfilippo commented Sep 25, 2024

Is your feature request related to a problem or challenge?

This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

LogicalPlans will use LogicalType while PhysicalPlans will use DataType.

Describe the solution you'd like

Make ScalarValue values logical:

@notfilippo notfilippo added the enhancement New feature or request label Sep 25, 2024
@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document.

cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Sep 26, 2024

I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community.

@findepi
Copy link
Member

findepi commented Sep 26, 2024

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
see #11513 (comment)
and also @andygrove 's feedback from Comet perspective #11513 (comment)

We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 26, 2024

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
We should also make it a goal that function are expressible directly on logical types.

While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)).

I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas.

@findepi
Copy link
Member

findepi commented Sep 27, 2024

Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans.
It looks like we have same discussion in two places now. I won't copy my reply from #11513, it's here for reference: #11513 (comment)
TL;DR is. -- the goals impact the design; incremental improvements lead to local optimum.

@notfilippo
Copy link
Contributor Author

notfilippo commented Sep 30, 2024

What follows is my idea of the plan we might want to take to tackle this issue.


Status and goals

Current status

image

  • Currently Datafusion uses DFSchema (a qualified arrow::Schema) as the type source for both its logical and physical plan.
  • During execution a physical plan gets interpreted against some physical data, in the form of record batches which contain (if everything is correct) the same schema expected as input by the plan.

#11513 initial goal

image

  • Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
  • Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
  • Execution will remain unchanged.

"Record batches as the physical type source" goal

image

  • Datafusion will only use logical type information for both logical and physical planning
  • During execution the physical type information will be sourced by the record batches as the single source of truth.

How do we get there?

a.k.a. changing the tires of a moving vehicle.

image

Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical.

Towards a Logical LogicalPlan

The Scalar type

image

Currently the LogicalPlan uses physical type information in the tree. This information comes from table scans and scalar values. We could initially look into modifying the scalar value in order to potentially represent fewer variants of types (from Utf8, LargeUtf8, Utf8View, Dictionary(Utf8) and more to just Utf8). But the physical type information required to be understandable by the current system still needs to be temporarily taken into account.

image

This would require the introduction of a scalar type which would track the physical type of the scalar in the current logical plan, wrapping the "logical" value of the scalar.

image

Casting to logically equivalent types should become straightforward.

This effort is being tracked in this epic in Make ScalarValue values logical section.

Introduce the logical type

Introduce LogicalType and the "logically equivalent" notion. Now ScalarValue will not have a data_type() method, instead it will return its LogicalType, and data_type() will be instead defined in Scalar. Currently there is some discussion in #11513 around how should it be represented.

Logical expressions

image

It's time to make Expr logical. While type sources like Scalar and TableScans's columns will still provide physical types, their interaction internally will happen via logical type comparisons.

What about functions?

image

UDFs should be able to the work in the logical type system. A user defined function should be able to:

  • Decide, given a signature, if a / some logical type can fullfill the requirements
  • Identify, given some logical inputs, a logical return type

Discussion is happening here: #12635

Logical LogicalPlan

image

Once expressions are logical we can move type sources to the logical plane of translate their type information from physical to logical upstream. (Like a TableProvider schema() (physical) call being immediately translated to logical).


👷 WIP "Record batches as the physical type source" goal

cc @alamb, @jayzhan211, @findepi lmk if this plan seems logical 😄

@alamb
Copy link
Contributor

alamb commented Oct 1, 2024

#11513 initial goal

Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
Execution will remain unchanged.

This seems like a good first step to me (and also a massive project in itself)

"Record batches as the physical type source" goal

I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete

UDFs should be able to the work in the logical type system. A user defined function should be able to:

I think UDFs will need to retain physical information (at least in invoke() which manipulates the data directly)

All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen!

@notfilippo
Copy link
Contributor Author

notfilippo commented Oct 7, 2024

I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi

@jayzhan211
Copy link
Contributor

jayzhan211 commented Oct 8, 2024

I can work on the logical type that replace current function signature on main. #12751 (comment)

@notfilippo
Copy link
Contributor Author

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

@jayzhan211
Copy link
Contributor

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

Sure

@findepi
Copy link
Member

findepi commented Oct 9, 2024

What follows is my idea of the plan we might want to take to tackle this issue.

@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you

  • I was planning on introducing the logical types

@notfilippo awesome!
i was hoping we get logical types sooner than later, even if nothing uses them initially.
simple functions #12635 is currently blocked on this

@notfilippo
Copy link
Contributor Author

i was hoping we get logical types sooner than later, even if nothing uses them initially. Simple functions #12635 is currently blocked on this

I've opened #12853 targeting main.

@notfilippo
Copy link
Contributor Author

Can a committer merge #13016 on logical-types? cc @alamb @jayzhan211

@jayzhan211
Copy link
Contributor

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

@tobixdev
Copy link
Contributor

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

@jayzhan211
Copy link
Contributor

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

It would be great! I think we need to rebase the branch first then remove scalarvlaue utf8view and largeutf8

@tobixdev
Copy link
Contributor

Sounds good! I've started updating this branch to the main branch on my fork.
There were a bunch of merge conflicts, so hopefully I didn't mess anything up during resolving them.

I still need to iron out some issues, but I hope to create a PR tomorrow.

@tobixdev
Copy link
Contributor

tobixdev commented Jan 17, 2025

Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion).
Before investing more time in this, I'd like to start another discussion about whether this is the solution we are going for (kind of like a result from adapting the new code from main).

As I understand it, the new Scalar struct is only necessary for carrying a physical data type for a logical value so that we can support operations like arrow_cast.

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

Even this may be acceptable if the resulting solution is stable for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided.

Therefore, are we sure that we require the Scalar type? If yes, I can proceed with updating the PR. If not, maybe we should try to infer the necessary physical types from the schema to avoid this large breaking change.

I also experimented with adding a new variant WithPhysicalType to ScalarValue. This works great regarding breaking changes as most match statements only consider the set of "supported" scalars. Unfortunately, it allows users to miss "logically equivalent" values with a different physical type when they only match against the regular variants. So I also think this is not an adequate solution (e.g., ScalarValue::WithPhysicalType(Box::new(ScalarValue::Utf8("abc"), DataType::Utf8View) will not be matched by ScalarValue::Utf8("abc"). However, it may be a less invasive transitional vehicle to carry the physical data type until we can infer the data type from the schema (and update optimizations etc.) to respect this.

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

Adopting this strategy would mean that we cannot release logical types until we can get rid of the WithPhysicalType variant, which is unfortunate. Maybe we could even release this to the downstream dependencies with adequate documentation and maybe marking the variant as deprecated? However, not breaking the pattern-matching capabilities of ColumnarValue and Expr may be worth the effort, and synching the main branch should be more straightforward as we also do not have that many breaking changes.

Any thoughts on that?

cc @notfilippo @findepi @alamb @jayzhan211

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jan 18, 2025

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

The following is equivalent

ScalarValue::Utf8View

Scalar {
  ScalarValue::Utf8(String),
  DataType::Utf8View
}
ScalarValue::LargeUtf8

Scalar {
  ScalarValue::Utf8(String),
  DataType::LargeUtf8
}

Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided

I agree, ScalarValue::Utf8(String) is probably enough, but given the current status of the code, jump directly to this state might be a challenge because we might not have schema to tell which physical type the string is for any kind of type matching code 🤔

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

@tobixdev
Copy link
Contributor

tobixdev commented Jan 18, 2025

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

Sorry I wasn't clear on this. I meant that matching directly on ColumnarValue is not possible anymore. Basically, you have to create nested match statements and call .value() on the Scalar.

let take_idx = match &args[2] {
    ColumnarValue::Scalar(ScalarValue::Int64(Some(v))) if v < &2 => *v as usize,
    _ => unreachable!(),
};

becomes

let take_idx = match &args[2] {
    ColumnarValue::Scalar(scalar) => match scalar.value() {
        ScalarValue::Int64(Some(v)) if v < &2 => *v as usize,
        _ => unreachable!(),
    },
    _ => unreachable!(),
}

And this isn't a deal breaker for me as one may argue that distinguishing between scalar/non-scalar and scalar types are two pair of shoes and should be handled separately, but at least in some cases this will impact ergonomics (e.g., tests). Maybe this also isn't such a big deal if downstream projects are not using these types of patterns.

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

I also don't know a good solution and maybe Scalar is the best way to achieve this. But maybe we are also approaching this wrong and we might just require some tinkering at a different spot such that removing ScalarValue::LargeUtf8 and ScalarValue::Utf8View just becomes removing the enum variants. Sorry if I am rehashing some conversations here that you already had earlier. Just want to make sure that we are on the right track. :)

EDIT:
So I've given it some more thought and after experimenting a bit I think I am sold on using Scalar like it was intended. So if no one suggests an alternative I'll be working on getting the branch up-to-date.

@jayzhan211
Copy link
Contributor

Pattern matching is not impossible

        match value {
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Int64(_)) && scalar.as_i64() > &2 => {
                
            }
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Utf8(_)) && scalar.as_string().as_str() == "datafusion" => {
                
            }
        }
    pub fn as_i64_opt(&self) -> Option<&i64> {
        match self.value() {
            ScalarValue::Int64(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_i64(&self) -> &i64 {
        self.as_i64_opt().unwrap()
    }

    pub fn as_string_opt(&self) -> Option<&String> {
        match self.value() {
            ScalarValue::Utf8(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_string(&self) -> &String {
       self.as_string_opt().unwrap()
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants