Make a faster way to check column existence in optimizer (not `is_err()`) #5309

alamb · 2023-02-16T19:07:31Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157

There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column

For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)

input_schema.field_from_column(col).is_ok()

This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157

As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,

Thus I propose:

Add new functions for checking that return a bool rather than an error
Replace the use of is_err() with

Find the field with the given qualified column

For example,

impl DFSchema {
  // existing function that returns Result
  pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}

  // new function that returns bool  <---- Add this new function
  pub fn has_column(&self, column: &Column) -> bool {...}
}

And then replace in the code that have the pattern

input_schema.field_from_column(col).is_ok()

With

input_schema.has_column(col)

Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )

Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot

Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise

The text was updated successfully, but these errors were encountered:

suxiaogang223 · 2023-02-17T07:44:51Z

i'm happy to do this🌝

ygf11 · 2023-02-17T09:30:28Z

Seems it is relative to the pr #5287.

suxiaogang223 · 2023-02-17T17:06:40Z

Seems it is relative to the pr #5287.

Yes, I think the method had_column also should distinguish FieldNotFound and Ambiguous reference error.
Maybe the new method should be

pub fn has_column(&self, column: &Column) -> Result<bool> {...}

alamb · 2023-02-17T18:14:44Z

The key for performance is not to return a DataFusionError with a allocated string

suxiaogang223 · 2023-02-17T18:24:06Z

The key for performance is not to return a DataFusionError with a allocated string

Maybe we can use assert to assume that the "Ambiguous reference error" should not happen in had_column?

alamb · 2023-02-17T18:38:14Z

Maybe we can use assert to assume that the "Ambiguous reference error" should not happen in had_column?

How about returning an enum like

enum FoundColumn {
  Found,
  NotFound,
  Ambiguous
}


pub fn has_column(&self, column: &Column) -> FoundColumn {...}

?

suxiaogang223 · 2023-02-17T19:45:33Z

I think returning enum will be same as returning result, because the caller also have to handle Ambiguous and return an Err.

Returning Result<bool> can also avoid str allocating in field_from_name().is_err(). The code will be like this:

if schema.had_column(col)? {...}

Maybe the key question is that do we need to check 'Ambiguous error' each time the had_column called? Actually we can just check Ambiguous err once at begin.

I'm not sure if my thinking is correct, need your advice @alamb 🤓

alamb · 2023-02-18T13:41:28Z

Not sure -- will try and check out #5328 shortly

matthewmturner · 2023-12-29T05:50:31Z

@alamb i can pick this up

alamb · 2023-12-30T14:10:19Z

@alamb i can pick this up

Thank you @matthewmturner -- I think this is a "tip of the iceberg" type bug where there are many places in the optimizzer that use DFSchema that could be made faster.

Thus I suggest if possible, taking some time to map out a plan to incrementally improve the situation over time

matthewmturner · 2023-12-30T14:33:05Z

@alamb Sounds good will do that

alamb · 2023-12-30T14:46:19Z

#8665 (comment) might be instructive

matthewmturner · 2023-12-30T15:43:26Z

@alamb aha i had plans to profile that exact thing as a starting point.

matthewmturner · 2023-12-30T23:14:33Z

I tried reproducing your results with Instruments but wasnt able to get to the granularity that you had that showed DFSchema as being heavy.

However, I put together a flamegraph and came to similar conclusion. In the below image the blocks in purple are for my search of DFSchema. Of those, there was a lot of merge and field_with_qualified_name (which is often called by merge) - this appears to be consistent with your profiling. It also looks like all uses of DFSchema are during the optimization pass which is consistent with your observation.

Based on this, and how field_with_name / field_with_qualified_name are used within merge I think I may be able to simply replace them with has_column_with_unqualified_name / has_column_with_qualified_name which return booleans.

Im hoping, time permitting, to also do some memory / allocations profiling to make sure these types of change have the desired effect.

alamb · 2023-12-31T12:50:59Z

simply replace them with has_column_with_unqualified_name / has_column_with_qualified_name which return booleans.

The other thing to do would be to look into making DFSchema cheaper to copy/create, for example using an Arc instead of OwnedTableReference (much as @tustvold did for Field in arrow-rs's Fields) so that copying a DFField doesn't require copying around strings

https://github.com/apache/arrow-datafusion/blob/848f6c395afef790880112f809b1443949d4bb0b/datafusion/common/src/dfschema.rs#L810

matthewmturner · 2024-01-05T14:56:45Z

@alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately.

The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.

matthewmturner · 2024-01-05T16:04:36Z

Just from updating the merge function we already see considerable improvements

alamb · 2024-01-05T20:35:58Z

@alamb sorry for delay here, I went down a rabbit hole of trying to get some good memory / allocation benchmarks as a i really wanted to be able measure / compare cause (allocations) instead of symptom (time). made good progress but dont want to hold this up any longer and can continue that work separately.

The low hanging fruit and what this issue was created for seems to be updating those function calls so I think I will start with that and separately we can look into updating how schemas are handled - if thats okay with you.

Sounds like a great plan -- thank you!

matthewmturner · 2024-01-12T16:00:53Z

@alamb
I've been looking into this more for places where we can replace unused results with booleans but nothing stuck out for that (let me know if you know or your intuition say otherwise). I've also been using the great analysis from @zeodotr in #7698 (comment) to guide some of my review.

A couple things:

I looked at optimization 6 from @zeodotr's list and I wasnt able to find columnize_expr as a hot spot in the context of creating physical plan (I tried reproducing on a wide table with several aggregates) which i believe is the use case they had (i didnt create 3000+ aggregates though like they have). it shows up as ~3% of cpu of creating unoptimized logical plan.
I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is (~87%) is now coming from has_column_with_qualified_name (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).

Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start.

alamb · 2024-01-12T20:05:38Z

I profiled the benchmark for a simple query on a wide table (700 columns) and a significant amount of the cpu time is (~87%) is now coming from has_column_with_qualified_name (first screenshot below). 87% in the case of creating physical plan and 66% of creating unoptimized logical plan (second screenshot).

Given this seems to be hotspot for wide tables do you think best next step would be looking into improving lookup time by adding a btree (or whatever) or should we improve the foundation and work on updating the schema first? from what ive seen updating the schema may make adding the index easier so that may be a good start.

Yes I agree getting DFSchema into better shape (e.g. not actually copying so many things) would likely make this task easier

It also looks like has_column_with_qualified_name is always being called from DFSchema::merge I wonder if we can figure out why that needs to be called so much. My bet is that most of the callsites dont' actually add any new fields. Maybe we can quickly check if the pass didn't many any changes to the children, then there is no need to call DFSchema::merge

Or maybe we can find some way to quickly compare if two schemas are the same 🤔

comphead · 2024-07-09T21:52:59Z

I think this can be closed @alamb? The original issue was resolved and has_column returns bool instead of error

comphead · 2024-07-09T21:55:26Z

@alamb i'm closing it, feel free to reopen it if needed

alamb added enhancement New feature or request good first issue Good for newcomers labels Feb 16, 2023

alamb mentioned this issue Feb 16, 2023

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

Closed

alamb mentioned this issue Feb 17, 2023

Fix the potential bug of check_all_column_from_schema #5287

Merged

jackwener assigned suxiaogang223 Feb 17, 2023

suxiaogang223 mentioned this issue Feb 18, 2023

[feat]:fast check has column #5328

Merged

alamb mentioned this issue Mar 18, 2023

[Epic] A collection of issues to improve planning performance / speed / efficiency #5637

Open

15 tasks

This was referenced Sep 11, 2023

Performance Regression: Backtraces in errors slow down planning time (Expensive backtraces) #7522

Closed

Improve optimizer performance by not using Errors in the happy path #7552

Closed

matthewmturner mentioned this issue Jan 5, 2024

Minor: Use faster check for column name in schema merge #8765

Merged

matthewmturner mentioned this issue Jan 12, 2024

Make DfSchema wrap SchemaRef #4680

Closed

comphead mentioned this issue Feb 2, 2024

WIP [Performance] Optimize DFSchema search by field #9104

Closed

comphead closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make a faster way to check column existence in optimizer (not `is_err()`) #5309

Make a faster way to check column existence in optimizer (not `is_err()`) #5309

alamb commented Feb 16, 2023 •

edited

Loading

suxiaogang223 commented Feb 17, 2023

ygf11 commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023

alamb commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023

alamb commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023 •

edited

Loading

alamb commented Feb 18, 2023

matthewmturner commented Dec 29, 2023

alamb commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

alamb commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

alamb commented Dec 31, 2023

matthewmturner commented Jan 5, 2024

matthewmturner commented Jan 5, 2024 •

edited

Loading

alamb commented Jan 5, 2024

matthewmturner commented Jan 12, 2024 •

edited

Loading

alamb commented Jan 12, 2024

comphead commented Jul 9, 2024

comphead commented Jul 9, 2024

Make a faster way to check column existence in optimizer (not is_err()) #5309

Make a faster way to check column existence in optimizer (not is_err()) #5309

Comments

alamb commented Feb 16, 2023 • edited Loading

suxiaogang223 commented Feb 17, 2023

ygf11 commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023

alamb commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023

alamb commented Feb 17, 2023

suxiaogang223 commented Feb 17, 2023 • edited Loading

alamb commented Feb 18, 2023

matthewmturner commented Dec 29, 2023

alamb commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

alamb commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

matthewmturner commented Dec 30, 2023

alamb commented Dec 31, 2023

matthewmturner commented Jan 5, 2024

matthewmturner commented Jan 5, 2024 • edited Loading

alamb commented Jan 5, 2024

matthewmturner commented Jan 12, 2024 • edited Loading

alamb commented Jan 12, 2024

comphead commented Jul 9, 2024

comphead commented Jul 9, 2024

Make a faster way to check column existence in optimizer (not `is_err()`) #5309

Make a faster way to check column existence in optimizer (not `is_err()`) #5309

alamb commented Feb 16, 2023 •

edited

Loading

suxiaogang223 commented Feb 17, 2023 •

edited

Loading

matthewmturner commented Jan 5, 2024 •

edited

Loading

matthewmturner commented Jan 12, 2024 •

edited

Loading