[EPIC] Decouple logical from physical types #12622

notfilippo · 2024-09-25T14:41:34Z

Is your feature request related to a problem or challenge?

This epic tracks an ordered list tasks related to the proposal: Decouple logical from physical types (#11513). The goal is:

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

LogicalPlans will use LogicalType while PhysicalPlans will use DataType.

Describe the solution you'd like

Make ScalarValue values logical:

Introduce Scalar type for ColumnarValue #12536 (merged in logical-types)
[logical-types] use Scalar in Expr::Logical #12793 (merged in logical-types)
feat(logical-types): add NativeType and LogicalType #12853
Remove ScalarValue::LargeUtf8 and ScalarValue::Utf8View in favour of ScalarValue::Utf8
Remove ScalarValue::LargeBinary and ScalarValue::BinaryView in favour of ScalarValue::Binary
Remove ScalarValue::Dictionary (from Remove ScalarValue::Dictionary #12488)

The text was updated successfully, but these errors were encountered:

notfilippo · 2024-09-26T07:58:55Z

I would like to organise a call so we can discuss the plan of action of this epic (as proposed #12536 (review) and in the ASF slack). In the meantime I'll try to collect as many open questions as possible and outline a meeting document.

cc @alamb , @findepi , @jayzhan211 , @ozankabak and anyone else interested in this effort.

jayzhan211 · 2024-09-26T08:15:25Z

I suggest to propose the plan directly, since it requires thinking to response, maybe not that efficiently to work in a call synchronously. And it would be more open to random people in the community.

findepi · 2024-09-26T14:11:52Z

Logical operators during logical planning should unquestionably not have access to the physical type information, which should exclusively be reserved to the physical planning and physical execution.

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
see #11513 (comment)
and also @andygrove 's feedback from Comet perspective #11513 (comment)

We should also make it a goal that function are expressible directly on logical types. Otherwise we won't be able to use them in logical planning. While function invocation can continue to work on physical representation, it does not necessarily have to be function implementor's burden to cater for them. See #12635

notfilippo · 2024-09-26T14:23:29Z

We should make it a goal that physical planning also abstracts over physical representation of individual batches.
We should also make it a goal that function are expressible directly on logical types.

While I support both ideas (one of them was also mentioned in the original proposal) I think that there is still some discussion to be had before making them goals of this effort (#11513 (comment), #11513 (comment), and #11513 (comment)).

I suggest we first define a plan (I'm drafting it now) for decoupling logical types from physical types (including the issue with functions that you are mentioning) and in parallel we can continue to validate this two ideas.

findepi · 2024-09-27T08:55:27Z

Thanks @notfilippo . I understand it's not your goal to remove Arrow types from physical plans.
It looks like we have same discussion in two places now. I won't copy my reply from #11513, it's here for reference: #11513 (comment)
TL;DR is. -- the goals impact the design; incremental improvements lead to local optimum.

notfilippo · 2024-09-30T17:55:48Z

What follows is my idea of the plan we might want to take to tackle this issue.

Status and goals

Current status

Currently Datafusion uses DFSchema (a qualified arrow::Schema) as the type source for both its logical and physical plan.
During execution a physical plan gets interpreted against some physical data, in the form of record batches which contain (if everything is correct) the same schema expected as input by the plan.

#11513 initial goal

Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
Execution will remain unchanged.

"Record batches as the physical type source" goal

Datafusion will only use logical type information for both logical and physical planning
During execution the physical type information will be sourced by the record batches as the single source of truth.

How do we get there?

a.k.a. changing the tires of a moving vehicle.

Introducing logical type limitations internally while keeping inputs and outputs the same should be helpful in order not to introduce too many breaking changes at once. At a high level, we could keep the type sources unchanged and just gradually limit the internal behaviour of the engine in order to reason about logical types instead of physical.

Towards a Logical LogicalPlan

The `Scalar` type

Currently the LogicalPlan uses physical type information in the tree. This information comes from table scans and scalar values. We could initially look into modifying the scalar value in order to potentially represent fewer variants of types (from Utf8, LargeUtf8, Utf8View, Dictionary(Utf8) and more to just Utf8). But the physical type information required to be understandable by the current system still needs to be temporarily taken into account.

This would require the introduction of a scalar type which would track the physical type of the scalar in the current logical plan, wrapping the "logical" value of the scalar.

Casting to logically equivalent types should become straightforward.

This effort is being tracked in this epic in Make ScalarValue values logical section.

Introduce the logical type

Introduce LogicalType and the "logically equivalent" notion. Now ScalarValue will not have a data_type() method, instead it will return its LogicalType, and data_type() will be instead defined in Scalar. Currently there is some discussion in #11513 around how should it be represented.

Logical expressions

It's time to make Expr logical. While type sources like Scalar and TableScans's columns will still provide physical types, their interaction internally will happen via logical type comparisons.

What about functions?

UDFs should be able to the work in the logical type system. A user defined function should be able to:

Decide, given a signature, if a / some logical type can fullfill the requirements
Identify, given some logical inputs, a logical return type

Discussion is happening here: #12635

Logical LogicalPlan

Once expressions are logical we can move type sources to the logical plane of translate their type information from physical to logical upstream. (Like a TableProvider schema() (physical) call being immediately translated to logical).

👷 WIP "Record batches as the physical type source" goal

cc @alamb, @jayzhan211, @findepi lmk if this plan seems logical 😄

alamb · 2024-10-01T10:18:14Z

#11513 initial goal

Datafusion will be using a more generic schema as input for it's logical plan, containing only logical type information.
Physical planning will remain unchanged, requiring instead the complete schema with the physical type information.
Execution will remain unchanged.

This seems like a good first step to me (and also a massive project in itself)

"Record batches as the physical type source" goal

I think it is wise to split this out into its own second goal (as it can scope down the first part). It is probably good to note that this means we won't have "adaptive schema" in DataFusion at runtime until the second part is complete

UDFs should be able to the work in the logical type system. A user defined function should be able to:

I think UDFs will need to retain physical information (at least in invoke() which manipulates the data directly)

All in all this sounds like an epic project. I hope we have the people who are excited to help make it happen!

notfilippo · 2024-10-07T16:40:30Z

I've opened #12793 in order to continue the effort according to the plan. cc @jayzhan211 @findepi

jayzhan211 · 2024-10-08T02:33:37Z

I can work on the logical type that replace current function signature on main. #12751 (comment)

notfilippo · 2024-10-08T09:05:40Z

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

jayzhan211 · 2024-10-08T09:19:33Z

I can work on the logical type that replace current function signature on main. #12751 (comment)

@jayzhan211 -- I was planning on introducing the logical types in the following PR, as I already started working on it. Do you mind waiting for it and then using it to replace the function signature?

Sure

findepi · 2024-10-09T09:57:39Z

What follows is my idea of the plan we might want to take to tackle this issue.

@notfilippo this (#12622 (comment)) is a very nice graphical representation of what we want to do. thank you

I was planning on introducing the logical types

@notfilippo awesome!
i was hoping we get logical types sooner than later, even if nothing uses them initially.
simple functions #12635 is currently blocked on this

notfilippo · 2024-10-10T14:59:53Z

i was hoping we get logical types sooner than later, even if nothing uses them initially. Simple functions #12635 is currently blocked on this

I've opened #12853 targeting main.

notfilippo · 2024-10-21T14:02:32Z

Can a committer merge #13016 on logical-types? cc @alamb @jayzhan211

jayzhan211 · 2025-01-11T07:30:41Z

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

tobixdev · 2025-01-16T10:28:16Z

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

jayzhan211 · 2025-01-16T10:55:47Z

I'd like to help out with this (see this comment for some context).

Depending on how we proceed, I can help out with rebasing this branch or by trying to remove ScalarValue::LargeUtf8 etc. (hopefully with some inspiration from #11978).
Just lmk how you want to proceed.

It would be great! I think we need to rebase the branch first then remove scalarvlaue utf8view and largeutf8

tobixdev · 2025-01-16T21:48:44Z

Sounds good! I've started updating this branch to the main branch on my fork.
There were a bunch of merge conflicts, so hopefully I didn't mess anything up during resolving them.

I still need to iron out some issues, but I hope to create a PR tomorrow.

tobixdev · 2025-01-17T11:06:30Z

Merging the branch with the upstream is quite a project as there are many merge conflicts and a lot of incompatible code was created in the meantime (on the other hand, it's good to see so much progress on DataFusion).
Before investing more time in this, I'd like to start another discussion about whether this is the solution we are going for (kind of like a result from adapting the new code from main).

As I understand it, the new Scalar struct is only necessary for carrying a physical data type for a logical value so that we can support operations like arrow_cast.

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

Even this may be acceptable if the resulting solution is stable for the foreseeable future when considering the possible impact of logical types. However, being unable to match scalar types and values will undoubtedly impact ergonomics for some use cases. Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided.

Therefore, are we sure that we require the Scalar type? If yes, I can proceed with updating the PR. If not, maybe we should try to infer the necessary physical types from the schema to avoid this large breaking change.

I also experimented with adding a new variant WithPhysicalType to ScalarValue. This works great regarding breaking changes as most match statements only consider the set of "supported" scalars. Unfortunately, it allows users to miss "logically equivalent" values with a different physical type when they only match against the regular variants. So I also think this is not an adequate solution (e.g., ScalarValue::WithPhysicalType(Box::new(ScalarValue::Utf8("abc"), DataType::Utf8View) will not be matched by ScalarValue::Utf8("abc"). However, it may be a less invasive transitional vehicle to carry the physical data type until we can infer the data type from the schema (and update optimizations etc.) to respect this.

Since the logical-types branch can easily diverge from the main branch, even when the sub-tasks are incomplete, would it be better to merge it into the main branch frequently and continue evolving it as new ideas emerge?

Adopting this strategy would mean that we cannot release logical types until we can get rid of the WithPhysicalType variant, which is unfortunate. ~~Maybe we could even release this to the downstream dependencies with adequate documentation and maybe marking the variant as deprecated~~? However, not breaking the pattern-matching capabilities of ColumnarValue and Expr may be worth the effort, and synching the main branch should be more straightforward as we also do not have that many breaking changes.

Any thoughts on that?

cc @notfilippo @findepi @alamb @jayzhan211

jayzhan211 · 2025-01-18T06:37:13Z

Unfortunately, using Scalar in ColumnarValue and Expr breaks all patterns that try to match the scalar for these types.
While we can fix this in the DataFusion code base (we need a nested match or smth similar in many cases), all downstream dependencies must rewrite their pattern-matching code, which can be (I assume) a huge undertaking for many projects.
Here is an example from the current version of logical-types.

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

The following is equivalent

ScalarValue::Utf8View

Scalar {
  ScalarValue::Utf8(String),
  DataType::Utf8View
}

ScalarValue::LargeUtf8

Scalar {
  ScalarValue::Utf8(String),
  DataType::LargeUtf8
}

Furthermore, if we ever conclude that we do not need the physical type (e.g., because we can derive the physical type from the schema), these breaking changes could have been avoided

I agree, ScalarValue::Utf8(String) is probably enough, but given the current status of the code, jump directly to this state might be a challenge because we might not have schema to tell which physical type the string is for any kind of type matching code 🤔

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

tobixdev · 2025-01-18T09:45:38Z

I think downstream project are not all forced to switch to Scalar they can keep their ScalarValue matching code, except for Utf8View and LargeUtf8 matching cases

Scalar is introduced to separate the concept of arrow's DataType with the value itself.

Sorry I wasn't clear on this. I meant that matching directly on ColumnarValue is not possible anymore. Basically, you have to create nested match statements and call .value() on the Scalar.

let take_idx = match &args[2] {
    ColumnarValue::Scalar(ScalarValue::Int64(Some(v))) if v < &2 => *v as usize,
    _ => unreachable!(),
};

becomes

let take_idx = match &args[2] {
    ColumnarValue::Scalar(scalar) => match scalar.value() {
        ScalarValue::Int64(Some(v)) if v < &2 => *v as usize,
        _ => unreachable!(),
    },
    _ => unreachable!(),
}

And this isn't a deal breaker for me as one may argue that distinguishing between scalar/non-scalar and scalar types are two pair of shoes and should be handled separately, but at least in some cases this will impact ergonomics (e.g., tests). Maybe this also isn't such a big deal if downstream projects are not using these types of patterns.

If we can find such approach it would be great. Scalar is just a practical approach that we can iterate on it with. There might be a better solution toward the goal, but I have no such idea in my mind.

I also don't know a good solution and maybe Scalar is the best way to achieve this. But maybe we are also approaching this wrong and we might just require some tinkering at a different spot such that removing ScalarValue::LargeUtf8 and ScalarValue::Utf8View just becomes removing the enum variants. Sorry if I am rehashing some conversations here that you already had earlier. Just want to make sure that we are on the right track. :)

EDIT:
So I've given it some more thought and after experimenting a bit I think I am sold on using Scalar like it was intended. So if no one suggests an alternative I'll be working on getting the branch up-to-date.

jayzhan211 · 2025-01-19T06:40:08Z

Pattern matching is not impossible

        match value {
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Int64(_)) && scalar.as_i64() > &2 => {
                
            }
            ColumnarValue::Scalar(scalar) if matches!(scalar.value(), ScalarValue::Utf8(_)) && scalar.as_string().as_str() == "datafusion" => {
                
            }
        }

    pub fn as_i64_opt(&self) -> Option<&i64> {
        match self.value() {
            ScalarValue::Int64(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_i64(&self) -> &i64 {
        self.as_i64_opt().unwrap()
    }

    pub fn as_string_opt(&self) -> Option<&String> {
        match self.value() {
            ScalarValue::Utf8(v) => v.as_ref(),
            _ => None,
        }
    }

    pub fn as_string(&self) -> &String {
       self.as_string_opt().unwrap()
    }

notfilippo added the enhancement New feature or request label Sep 25, 2024

findepi mentioned this issue Sep 27, 2024

Extension Types #12644

Open

notfilippo mentioned this issue Sep 30, 2024

Remove LargeUtf8|Binary, Utf8|BinaryView, and Dictionary from ScalarValue #11978

Closed

notfilippo mentioned this issue Oct 1, 2024

[Proposal] Decouple logical from physical types #11513

Open

notfilippo mentioned this issue Oct 7, 2024

[logical-types] use Scalar in Expr::Logical #12793

Merged

findepi mentioned this issue Oct 9, 2024

Simple Functions #12635

Open

notfilippo mentioned this issue Oct 10, 2024

feat(logical-types): add NativeType and LogicalType #12853

Merged

This was referenced Oct 16, 2024

Oct 16, 2024: This week in DataFusion #12973

Closed

Oct 21, 2024: This week in DataFusion #13035

Closed

jayzhan211 mentioned this issue Nov 7, 2024

[DISCUSSION] 2024 Q4 / 2025 Q1 Roadmap #13274

Open

gatesn mentioned this issue Nov 11, 2024

Access children DataType or return-type in ScalarUDFImpl::invoke #12819

Closed

findepi mentioned this issue Nov 16, 2024

Add support for Utf8View to string_to_array and array_to_string #13403

Merged

This was referenced Nov 18, 2024

Preserve field name when casting List #13468

Merged

[DISCUSSION] Making it easier to use DataFusion (lessons from GlareDB) #13525

Open

This was referenced Jan 19, 2025

Update logical-types to main #14202

Merged

Update Logical Types Branch #14241

Merged

findepi mentioned this issue Jan 23, 2025

Change ReturnTypeInfo to return a Field rather than DataType #14247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EPIC] Decouple logical from physical types #12622

[EPIC] Decouple logical from physical types #12622

notfilippo commented Sep 25, 2024 •

edited by jayzhan211

Loading

notfilippo commented Sep 26, 2024 •

edited

Loading

jayzhan211 commented Sep 26, 2024 •

edited

Loading

findepi commented Sep 26, 2024

notfilippo commented Sep 26, 2024 •

edited

Loading

findepi commented Sep 27, 2024

notfilippo commented Sep 30, 2024 •

edited

Loading

alamb commented Oct 1, 2024

notfilippo commented Oct 7, 2024 •

edited

Loading

jayzhan211 commented Oct 8, 2024 •

edited

Loading

notfilippo commented Oct 8, 2024

jayzhan211 commented Oct 8, 2024

findepi commented Oct 9, 2024

notfilippo commented Oct 10, 2024

notfilippo commented Oct 21, 2024

jayzhan211 commented Jan 11, 2025

tobixdev commented Jan 16, 2025

jayzhan211 commented Jan 16, 2025

tobixdev commented Jan 16, 2025

tobixdev commented Jan 17, 2025 •

edited

Loading

jayzhan211 commented Jan 18, 2025 •

edited

Loading

tobixdev commented Jan 18, 2025 •

edited

Loading

jayzhan211 commented Jan 19, 2025

[EPIC] Decouple logical from physical types #12622

[EPIC] Decouple logical from physical types #12622

Comments

notfilippo commented Sep 25, 2024 • edited by jayzhan211 Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

notfilippo commented Sep 26, 2024 • edited Loading

jayzhan211 commented Sep 26, 2024 • edited Loading

findepi commented Sep 26, 2024

notfilippo commented Sep 26, 2024 • edited Loading

findepi commented Sep 27, 2024

notfilippo commented Sep 30, 2024 • edited Loading

Status and goals

Current status

#11513 initial goal

"Record batches as the physical type source" goal

How do we get there?

Towards a Logical LogicalPlan

The Scalar type

Introduce the logical type

Logical expressions

What about functions?

Logical LogicalPlan

alamb commented Oct 1, 2024

notfilippo commented Oct 7, 2024 • edited Loading

jayzhan211 commented Oct 8, 2024 • edited Loading

notfilippo commented Oct 8, 2024

jayzhan211 commented Oct 8, 2024

findepi commented Oct 9, 2024

notfilippo commented Oct 10, 2024

notfilippo commented Oct 21, 2024

jayzhan211 commented Jan 11, 2025

tobixdev commented Jan 16, 2025

jayzhan211 commented Jan 16, 2025

tobixdev commented Jan 16, 2025

tobixdev commented Jan 17, 2025 • edited Loading

jayzhan211 commented Jan 18, 2025 • edited Loading

tobixdev commented Jan 18, 2025 • edited Loading

jayzhan211 commented Jan 19, 2025

notfilippo commented Sep 25, 2024 •

edited by jayzhan211

Loading

notfilippo commented Sep 26, 2024 •

edited

Loading

jayzhan211 commented Sep 26, 2024 •

edited

Loading

notfilippo commented Sep 26, 2024 •

edited

Loading

notfilippo commented Sep 30, 2024 •

edited

Loading

The `Scalar` type

notfilippo commented Oct 7, 2024 •

edited

Loading

jayzhan211 commented Oct 8, 2024 •

edited

Loading

tobixdev commented Jan 17, 2025 •

edited

Loading

jayzhan211 commented Jan 18, 2025 •

edited

Loading

tobixdev commented Jan 18, 2025 •

edited

Loading