Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build time regression #14256

Open
waynexia opened this issue Jan 24, 2025 · 14 comments
Open

Build time regression #14256

waynexia opened this issue Jan 24, 2025 · 14 comments
Labels
enhancement New feature or request

Comments

@waynexia
Copy link
Member

Is your feature request related to a problem or challenge?

We observed a huge increase after upgrading datafusion GreptimeTeam/greptimedb#5417. I run a script to test the build time change day by day since v38.0, and here is the result:

Image

The build time keeps increasing and is almost doubled since v38.0. Given the codebase keeps adding new code, it's expected to see a trend of increasing, but there are still some obvious "platforms" and "jumps", which might be abnormal.

This is the raw data:

build-times.csv

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@waynexia waynexia added the enhancement New feature or request label Jan 24, 2025
@waynexia
Copy link
Member Author

A few more details:

The build command I use is cargo build --release --timings --lib --quiet

The latest timing file is (rename it to HTML, restriction from GitHub): Cargo Timing Jan 23 2025.txt

And screenshot:
Image

The datafusion crate (core) spends time for single thread compiling, which would be a major reason for slowing down the build time. I can observe this bottleneck in the downstream greptimedb project as well.

@jayzhan211
Copy link
Contributor

After physical-optimizer, datasource could be a potential target to move out of core.

@findepi
Copy link
Member

findepi commented Jan 24, 2025

datasource could be a potential target to move out of core.

related

The build time keeps increasing and is almost doubled since v38.0. Given the codebase keeps adding new code, it's expected to see a trend of increasing

Did the amount of code double since v38? i doubt that, but @waynexia can you maybe chart build time against amount of code?
Also, can you make sure there is no bias in the measurement? If you build in reverse order and run cargo clean between each steps, will you get the same results? (i believe we regressed on the build time, but let's double check facts before jumping to conclusions)

and "jumps", which might be abnormal.

The biggest jump is at the beginning of August 2024.
Can you perhaps bisect those few days to determine the particular change that caused this?

Is the X axis position based on Commit Date of the current tip commit?

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

After physical-optimizer, datasource could be a potential target to move out of core.

Yes, 100% splitting out datasource is my next thing I would love to see (and I think it will help build times massively)

Thank you @waynexia for working on this project. It will be appreciated by all 🙏

As @jayzhan211 says, Once we complete this epic (@buraksenn and @berkaysynnada are pretty close)

Then I can help organize an effort to break out the data sources

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

100% making build time better would be really appreciated

@waynexia
Copy link
Member Author

Looking forward to the ongoing refactors!

Also, can you make sure there is no bias in the measurement? If you build in reverse order and run cargo clean between each steps, will you get the same results? (i believe we regressed on the build time, but let's double check facts before jumping to conclusions)

Here is the script, I removed the target after every build:

script
#!/bin/bash

# Configuration
OUTPUT_FILE="build-times.csv"
START_DATE=$(date -d "8 months ago" +%Y-%m-%d)
END_DATE=$(date +%Y-%m-%d)

# Initialize CSV output
echo "date,commit_hash,build_time_seconds,success" > "$OUTPUT_FILE"


# Iterate day-by-day
current_date="$START_DATE"
while [[ "$current_date" < "$END_DATE" ]]; do
  # Find latest commit for the day
  git checkout main
  commit_hash=$(git log --until="$current_date 23:59" --first-parent -1 --format=%H)

  if [[ -n "$commit_hash" ]]; then
    echo "Processing $current_date ($commit_hash)..."
    
    # Checkout commit (detached HEAD)
    git checkout --force "$commit_hash" > /dev/null 2>&1
    
    # Clean and build
    rm -rf target/release
    cargo metadata --quiet > /dev/null
    start_time=$(date +%s)
    cargo build --release --timings --lib --quiet
    build_exit_code=$?
    end_time=$(date +%s)
    build_time=$((end_time - start_time))

    # Record results
    echo "$current_date,$commit_hash,$build_time,$build_exit_code" >> "$OUTPUT_FILE"
  else
    echo "No commit found for $current_date"
  fi

  # Move to next day
  current_date=$(date -d "$current_date + 1 day" +%Y-%m-%d)
done

# Return to original branch
git checkout main 

@waynexia
Copy link
Member Author

I located the biggest jump happens on #11681, commit 3438b35. But I can't tell the reason. I go through it and it's just a normal logic change patch, no new dep or feature was introduced 🤔

@alamb
Copy link
Contributor

alamb commented Jan 26, 2025

I located the biggest jump happens on #11681, commit 3438b35. But I can't tell the reason. I go through it and it's just a normal logic change patch, no new dep or feature was introduced 🤔

I wonder if the new analyzer rule caused the jump 🤔 (datafusion/optimizer/src/analyzer/expand_wildcard_rule.rs)

One theory could be that the tree node walking code is substantial and slow to compile (it is a lot of templates 🤔 )

However, a new analyzer rule in datafusion-optimizer doesn't explain longer compile times in the datafusion crate 🤔

@waynexia
Copy link
Member Author

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

One guess is that both Expr and LogicalPlan are key enums used in many places, especially many type parameters (like the TreeNode). Thus a slight change to them will be amplified by a lot.

I have some thoughts on this specific case, like putting the expansion somewhere else, or changing some Fn type parameters to trait objects. But I can't tell how good it will be, and the balance between runtime overhead and build time consumption. Or we can wait for those refactors until we figure out why WildcardOptions causes that.

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

I wonder if the issue is that WildCardOptions is a large struct that is not Boxed -- so it means that every Expr needs to be large enough to hold it which maybe increased the size of all Exprs 🤔

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Thanks to some great work from @buraksenn @berkaysynnada and @logan-keede we have completed extracting physical optimizer rules:

Perhaps it is time to organize the next major project to extract datasources / listing table from the core 🤔

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

After removing the WildcardOptions (by replacing it with an empty structure) I can see the build time drops. Removing the rule itself and the change in core doesn't help. It looks like the change to Expr is the root cause, though I don't know the reason...

I wonder if the issue is that WildCardOptions is a large struct that is not Boxed -- so it means that every Expr needs to be large enough to hold it which maybe increased the size of all Exprs 🤔

🤔 it seems we have already Box'd it:
https://github.com/apache/datafusion/blob/60c0fb887e8723c47c268e1b578f2d78aebaefde/datafusion/expr/src/expr.rs#L316-L315

Let me see if I can find ways to make Expr smaller though

@alamb
Copy link
Contributor

alamb commented Jan 29, 2025

Let me see if I can find ways to make Expr smaller though

I can make Expr less than half the size in this PR:

I ran out of time today to run build benchmarks on it, but I will try to do it tomorrow if no one else beats me to it

@crepererum
Copy link
Contributor

I have another hunch, but still need to proof it: DataFusion uses a lot of

fn f(param: impl Into<Type>, ...) -> ... {
    let param: Type = param.into();
    ...
}

// call-site
let x: OtherType = ...;
f(x, ...);

This is mostly so that the caller can stick a lot of parameters into methods w/o needing to convert it to the right type. Sometimes it's also used for backwards compatible-ish changes1.

In Rust that impl constructs expands roughly to

fn f<T>(param: T) -> ...
where
    T: Into<Type>
{
    let param: Type = param.into();
    ...
}

Since Rust generics are implemented using monomorphization, this leads to the following issues:

  1. type replication: The function including its body are compiled for every parameter T. And this also includes multiple copies of the LLVM IR and resulting machine code. Note that this includes the WHOLE body, not just the let param: Type = param.into(); part2.
  2. delayed compilation: Since a crate that contains the aforementioned method cannot know all types T that you eventually gonna use, the compilation is done on the call site (instead the definition site). This however also means that you get even more copies of the function, because if you have a crate that defines the method and two crates that use it, the two user crates are compiled independently.

My guess is that while this leads to "nice APIs", it leads to a lot of compilation overhead both within DataFusion and for its users. I would therefore suggest that we change methods to look like this:

fn f(param: Type, ...) -> ... {
    ...
}

// call-site
let x: OtherType = ...;
f(x.into(), ...);

An intermediate solution that requires more code, has a slightly higher compilation overhead (due to more generics) but isn't as bad as the original version would be this:

fn f(param: impl Into<Type>, ...) -> ... {
    f_impl(param.into(), ...)
}

// private helper method
// may want to use `#[inline(never)]` here
fn f_impl(param: Type, ...) -> ... {
    ...
}

// call-site
let x: OtherType = ...;
f(x, ...);

Technically this also applies to impl IntoIterator<Item = Type> and impl Iterator<Item = Type>, but there it's usually harder to fix. There we would have the following options, all of which are slower than the original one due to dynamic dispatch:

// original version
fn g(it: impl IntoIterator<Item = Type>) {}

// unsized locals
// requires https://doc.rust-lang.org/beta/unstable-book/language-features/unsized-locals.html
fn g(it: dyn Iterator<Item = Type>) {}

// box version
fn g(it: Box<dyn Iterator<Item = Type>>) {}

// pass mut ref
fn g(it: &mut dyn Iterator<Item = Type>) {}

So I would leave the iterator cases out for now.

Footnotes

  1. That's strictly speaking NOT true, changing param: String to param: impl Into<String> is NOT backwards compatible since it affects type inference on the call site.

  2. This partial replication of code might one day be possible via polymorphization, but I don't expect this to happen within the next 3 years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants