Build time regression #14256

waynexia · 2025-01-24T02:37:00Z

Is your feature request related to a problem or challenge?

We observed a huge increase after upgrading datafusion GreptimeTeam/greptimedb#5417. I run a script to test the build time change day by day since v38.0, and here is the result:

The build time keeps increasing and is almost doubled since v38.0. Given the codebase keeps adding new code, it's expected to see a trend of increasing, but there are still some obvious "platforms" and "jumps", which might be abnormal.

This is the raw data:

build-times.csv

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

waynexia · 2025-01-24T02:47:41Z

A few more details:

The build command I use is cargo build --release --timings --lib --quiet

The latest timing file is (rename it to HTML, restriction from GitHub): Cargo Timing Jan 23 2025.txt

And screenshot:

The datafusion crate (core) spends time for single thread compiling, which would be a major reason for slowing down the build time. I can observe this bottleneck in the downstream greptimedb project as well.

jayzhan211 · 2025-01-24T03:09:33Z

After physical-optimizer, datasource could be a potential target to move out of core.

findepi · 2025-01-24T07:56:59Z

datasource could be a potential target to move out of core.

[Epic] Extract catalog functionality from the core to make it more modular #10782

The build time keeps increasing and is almost doubled since v38.0. Given the codebase keeps adding new code, it's expected to see a trend of increasing

Did the amount of code double since v38? i doubt that, but @waynexia can you maybe chart build time against amount of code?
Also, can you make sure there is no bias in the measurement? If you build in reverse order and run cargo clean between each steps, will you get the same results? (i believe we regressed on the build time, but let's double check facts before jumping to conclusions)

and "jumps", which might be abnormal.

The biggest jump is at the beginning of August 2024.
Can you perhaps bisect those few days to determine the particular change that caused this?

Is the X axis position based on Commit Date of the current tip commit?

alamb · 2025-01-24T22:18:09Z

After physical-optimizer, datasource could be a potential target to move out of core.

Yes, 100% splitting out datasource is my next thing I would love to see (and I think it will help build times massively)

Thank you @waynexia for working on this project. It will be appreciated by all 🙏

As @jayzhan211 says, Once we complete this epic (@buraksenn and @berkaysynnada are pretty close)

[EPIC] Extract remaining physical optimizer out of core #11502

Then I can help organize an effort to break out the data sources

alamb · 2025-01-25T10:51:00Z

100% making build time better would be really appreciated

waynexia · 2025-01-25T13:30:30Z

Looking forward to the ongoing refactors!

Also, can you make sure there is no bias in the measurement? If you build in reverse order and run cargo clean between each steps, will you get the same results? (i believe we regressed on the build time, but let's double check facts before jumping to conclusions)

Here is the script, I removed the target after every build:

script

#!/bin/bash

# Configuration
OUTPUT_FILE="build-times.csv"
START_DATE=$(date -d "8 months ago" +%Y-%m-%d)
END_DATE=$(date +%Y-%m-%d)

# Initialize CSV output
echo "date,commit_hash,build_time_seconds,success" > "$OUTPUT_FILE"


# Iterate day-by-day
current_date="$START_DATE"
while [[ "$current_date" < "$END_DATE" ]]; do
  # Find latest commit for the day
  git checkout main
  commit_hash=$(git log --until="$current_date 23:59" --first-parent -1 --format=%H)

  if [[ -n "$commit_hash" ]]; then
    echo "Processing $current_date ($commit_hash)..."
    
    # Checkout commit (detached HEAD)
    git checkout --force "$commit_hash" > /dev/null 2>&1
    
    # Clean and build
    rm -rf target/release
    cargo metadata --quiet > /dev/null
    start_time=$(date +%s)
    cargo build --release --timings --lib --quiet
    build_exit_code=$?
    end_time=$(date +%s)
    build_time=$((end_time - start_time))

    # Record results
    echo "$current_date,$commit_hash,$build_time,$build_exit_code" >> "$OUTPUT_FILE"
  else
    echo "No commit found for $current_date"
  fi

  # Move to next day
  current_date=$(date -d "$current_date + 1 day" +%Y-%m-%d)
done

# Return to original branch
git checkout main

waynexia · 2025-01-25T16:22:06Z

I located the biggest jump happens on #11681, commit 3438b35. But I can't tell the reason. I go through it and it's just a normal logic change patch, no new dep or feature was introduced 🤔

alamb · 2025-01-26T10:52:03Z

I located the biggest jump happens on #11681, commit 3438b35. But I can't tell the reason. I go through it and it's just a normal logic change patch, no new dep or feature was introduced 🤔

I wonder if the new analyzer rule caused the jump 🤔 (datafusion/optimizer/src/analyzer/expand_wildcard_rule.rs)

One theory could be that the tree node walking code is substantial and slow to compile (it is a lot of templates 🤔 )

However, a new analyzer rule in datafusion-optimizer doesn't explain longer compile times in the datafusion crate 🤔

waynexia · 2025-01-26T13:58:58Z

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

One guess is that both Expr and LogicalPlan are key enums used in many places, especially many type parameters (like the TreeNode). Thus a slight change to them will be amplified by a lot.

I have some thoughts on this specific case, like putting the expansion somewhere else, or changing some Fn type parameters to trait objects. But I can't tell how good it will be, and the balance between runtime overhead and build time consumption. Or we can wait for those refactors until we figure out why WildcardOptions causes that.

alamb · 2025-01-29T20:38:24Z

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

I wonder if the issue is that WildCardOptions is a large struct that is not Boxed -- so it means that every Expr needs to be large enough to hold it which maybe increased the size of all Exprs 🤔

alamb · 2025-01-29T20:38:32Z

Thanks to some great work from @buraksenn @berkaysynnada and @logan-keede we have completed extracting physical optimizer rules:

[EPIC] Extract remaining physical optimizer out of core #11502
@logan-keede is about to complete extracting the catalog:
[Epic] Extract catalog functionality from the core to make it more modular #10782

Perhaps it is time to organize the next major project to extract datasources / listing table from the core 🤔

alamb · 2025-01-29T20:53:36Z

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

I wonder if the issue is that WildCardOptions is a large struct that is not Boxed -- so it means that every Expr needs to be large enough to hold it which maybe increased the size of all Exprs 🤔

🤔 it seems we have already Box'd it:
https://github.com/apache/datafusion/blob/60c0fb887e8723c47c268e1b578f2d78aebaefde/datafusion/expr/src/expr.rs#L316-L315

Let me see if I can find ways to make Expr smaller though

alamb · 2025-01-29T21:50:54Z

Let me see if I can find ways to make Expr smaller though

I can make Expr less than half the size in this PR:

Reduce size of Expr struct #14366

I ran out of time today to run build benchmarks on it, but I will try to do it tomorrow if no one else beats me to it

crepererum · 2025-01-30T10:02:34Z

I have another hunch, but still need to proof it: DataFusion uses a lot of

fn f(param: impl Into<Type>, ...) -> ... {
    let param: Type = param.into();
    ...
}

// call-site
let x: OtherType = ...;
f(x, ...);

This is mostly so that the caller can stick a lot of parameters into methods w/o needing to convert it to the right type. Sometimes it's also used for backwards compatible-ish changes¹.

In Rust that impl constructs expands roughly to

fn f<T>(param: T) -> ...
where
    T: Into<Type>
{
    let param: Type = param.into();
    ...
}

Since Rust generics are implemented using monomorphization, this leads to the following issues:

type replication: The function including its body are compiled for every parameter T. And this also includes multiple copies of the LLVM IR and resulting machine code. Note that this includes the WHOLE body, not just the let param: Type = param.into(); part².
delayed compilation: Since a crate that contains the aforementioned method cannot know all types T that you eventually gonna use, the compilation is done on the call site (instead the definition site). This however also means that you get even more copies of the function, because if you have a crate that defines the method and two crates that use it, the two user crates are compiled independently.

My guess is that while this leads to "nice APIs", it leads to a lot of compilation overhead both within DataFusion and for its users. I would therefore suggest that we change methods to look like this:

fn f(param: Type, ...) -> ... {
    ...
}

// call-site
let x: OtherType = ...;
f(x.into(), ...);

An intermediate solution that requires more code, has a slightly higher compilation overhead (due to more generics) but isn't as bad as the original version would be this:

fn f(param: impl Into<Type>, ...) -> ... {
    f_impl(param.into(), ...)
}

// private helper method
// may want to use `#[inline(never)]` here
fn f_impl(param: Type, ...) -> ... {
    ...
}

// call-site
let x: OtherType = ...;
f(x, ...);

Technically this also applies to impl IntoIterator<Item = Type> and impl Iterator<Item = Type>, but there it's usually harder to fix. There we would have the following options, all of which are slower than the original one due to dynamic dispatch:

// original version
fn g(it: impl IntoIterator<Item = Type>) {}

// unsized locals
// requires https://doc.rust-lang.org/beta/unstable-book/language-features/unsized-locals.html
fn g(it: dyn Iterator<Item = Type>) {}

// box version
fn g(it: Box<dyn Iterator<Item = Type>>) {}

// pass mut ref
fn g(it: &mut dyn Iterator<Item = Type>) {}

So I would leave the iterator cases out for now.

That's strictly speaking NOT true, changing param: String to param: impl Into<String> is NOT backwards compatible since it affects type inference on the call site. ↩
This partial replication of code might one day be possible via polymorphization, but I don't expect this to happen within the next 3 years. ↩

waynexia added the enhancement New feature or request label Jan 24, 2025

alamb mentioned this issue Jan 25, 2025

Jan 18, 2025: This week(s) in DataFusion #14179

Open

alamb mentioned this issue Jan 29, 2025

Reduce size of Expr struct #14366

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build time regression #14256

Build time regression #14256

waynexia commented Jan 24, 2025

waynexia commented Jan 24, 2025

jayzhan211 commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

alamb commented Jan 25, 2025

waynexia commented Jan 25, 2025

waynexia commented Jan 25, 2025

alamb commented Jan 26, 2025

waynexia commented Jan 26, 2025

alamb commented Jan 29, 2025 •

edited

Loading

alamb commented Jan 29, 2025

alamb commented Jan 29, 2025 •

edited

Loading

alamb commented Jan 29, 2025

crepererum commented Jan 30, 2025

Build time regression #14256

Build time regression #14256

Comments

waynexia commented Jan 24, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

waynexia commented Jan 24, 2025

jayzhan211 commented Jan 24, 2025

findepi commented Jan 24, 2025

alamb commented Jan 24, 2025

alamb commented Jan 25, 2025

waynexia commented Jan 25, 2025

waynexia commented Jan 25, 2025

alamb commented Jan 26, 2025

waynexia commented Jan 26, 2025

alamb commented Jan 29, 2025 • edited Loading

alamb commented Jan 29, 2025

alamb commented Jan 29, 2025 • edited Loading

alamb commented Jan 29, 2025

crepererum commented Jan 30, 2025

Footnotes

alamb commented Jan 29, 2025 •

edited

Loading

alamb commented Jan 29, 2025 •

edited

Loading