Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Flux transformer benchmarking #870

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

sogartar
Copy link
Contributor

@sogartar sogartar commented Jan 27, 2025

Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own.

Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.

Example log from a benchmark comparison:

Comparing /home/bpetkant/ws/sharktank/repo/sharktank/tests/models/flux/flux_transformer_baseline_benchmark_result_mi300x.json to /home/bpetkant/ws/sharktank/experiments/flux/benchmark/benchmark_result.json
Benchmark                                                      Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------
BM_forward_bs1/process_time/real_time                       +0.0451         +0.0932           549           574           564           616
...
BM_forward_bs1/process_time/real_time                       +0.2122         +0.2508           554           672           569           711
BM_forward_bs1/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 30 vs 30
BM_forward_bs1/process_time/real_time_mean                  +0.0457         +0.0922           552           577           567           619
BM_forward_bs1/process_time/real_time_median                +0.0400         +0.0870           552           574           567           616
BM_forward_bs1/process_time/real_time_stddev               +12.3302        +11.8167             1            18             1            17
BM_forward_bs1/process_time/real_time_cv                   +11.7470        +10.7347             0             0             0             0
OVERALL_GEOMEAN                                             +0.0453         +0.0918             1             1             1             1

...

E           AssertionError: Benchmark contender mean real time 577.4150719943767 ms is worse than baseline 552.1548928634729 ms.

@sogartar sogartar force-pushed the flux-transformer-benchmark branch from 23e4919 to 913e9c0 Compare January 27, 2025 14:34
@sogartar
Copy link
Contributor Author

This PR depends on iree-org/iree-turbine#418.

@sogartar sogartar marked this pull request as ready for review January 27, 2025 14:39
@sogartar sogartar marked this pull request as draft January 27, 2025 15:03
@sogartar sogartar force-pushed the flux-transformer-benchmark branch from 913e9c0 to 47e648a Compare January 27, 2025 15:11
@sogartar sogartar marked this pull request as ready for review January 27, 2025 15:25
@sogartar
Copy link
Contributor Author

I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.

Comment on lines 88 to 91
pytest -n 4 sharktank/ --durations=10
pytest \
-n 4 \
--durations=10 \
-m "not expensive" \
sharktank/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful when adding new tests, particularly if they are expensive. Developers should be able to run pytest and expect a reasonable default set of tests to run. Developers will not remember opt-out marks like this.

The filtering also didn't work here? https://github.com/nod-ai/shark-ai/actions/runs/13079961026/job/36501056330?pr=870 is still running on standard github-hosted runners after 2h+.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test failure is due to this required PR iree-org/iree-turbine#418.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the default marker selection in pyproject.toml to exclude expensive.

Comment on lines +1 to +3
[submodule "third_party/benchmark"]
path = third_party/benchmark
url = https://github.com/google/benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually used? You have google-benchmark in Python test requirements already.

A source dependency on the C++ library could be added to shortfin as needed, probably via FetchContent here:

if(SHORTFIN_BUILD_TESTS)
if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR)
# For now we use gtest shipped alongside with IREE.
FetchContent_Declare(
googletest
URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip
)
# For Windows: Prevent overriding the parent project's compiler/linker settings
set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
FetchContent_MakeAvailable(googletest)
endif()
include(GoogleTest)
enable_testing()
add_custom_target(shortfin_testdata_deps)
endif()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrongly left the pip package dependency. It is removed now. Unfortunately, the script to compare benchmark results is not a part of the pip package. This script is used in the python test.
See https://github.com/google/benchmark/blob/main/docs/tools.md#comparepy
That is why the submodule is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd much prefer we consolidate existing nightly workflows (eval, sglang benchmark, llama large, etc.) before adding a new one. That being said, I like the "sharktank-nightly" name better than model-specific names...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the CI jobs are due for a refactoring. To reorganize them and to reduce code duplication. I want to do that next.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The advantage of keeping them separately is the ease of tracking various nightly across llm, flux models and sharktank/ shortfin regressions. If there are tools to do this, or a CI summary in github-pages, that would be great.

--verbose \
--capture=no \
--iree-hip-target=gfx942 \
--iree-device=hip://6 \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Match what other workflows do with hip devices: #725

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually allocate GPUs to CI jobs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change it to used hip://0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Generally, CI jobs should be written to run on any compatible runner machine and be easy to reference and run locally. The runners themselves should be responsible for assigning devices.

Comment on lines 78 to 86
--html=out/benchmark/index.html \
sharktank/tests

- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@4f9cc6602d3f66b9c108549d475ec49e8ef4d45e # v4.0.0
with:
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
publish_dir: ./out/benchmark
destination_dir: ./benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"benchmark" isn't a very specific name, and it also doesn't match "sharktank nightly". Consider how you want this to slot in to

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR for the GH pages #899.

@sogartar sogartar requested a review from ScottTodd February 3, 2025 16:48
with:
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
publish_dir: ./out/sharktank-nightly
destination_dir: ./sharktank-nightly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe publish_dir should be ./out/sharktank_nightly/benchmark. Same for destination_dir
Also, we need to update sharktank-nightly to sharktank_nightly

Copy link
Contributor Author

@sogartar sogartar Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have one step to push all sharktank_nightly artifacts. This workflow in the future may have more jobs that produce artifacts. I would be surprised if the action does not support whole directory trees.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, were you able to test this nightly on this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow is not yet available and I can't trigger it manually. I will trigger it once it is on main.

Add also some more general functionality to check benchmark results
against baseline results. This requires the Google Benchmark compare.py
tool that is not a part of the pip package. That is why I added the
repo as a git submodule. This tool does a statistical comparison
between benchmarks with proper p-value calculation. I don't think we
should roll out our own.

Adds a new nightly CI job that should contain nightly tests and
benchmarks that are not in their own category like Llama is now.
@sogartar sogartar force-pushed the flux-transformer-benchmark branch 2 times, most recently from 88aad03 to f88784c Compare February 13, 2025 13:55
Comment on lines 20 to 22
iree_compile_flags = [
"--iree-hal-target-device=hip",
"--iree-hip-target=gfx942",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does export.py include flags for compiling? I would expect this file to only export from huggingface / PyTorch / etc. to MLIR and not go further.

If including flags, do not overspecialize on a single target (hip) or type of device (gfx942). We should also be reducing these lists of flags down to only the most essential (the target backend and feature set), so putting 15 extra flags this far up the pipeline will make it more difficult to scale this work across other targets and models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the args to a new compile.py file. I agree that refactoring the compilation process is needed, but I think it is outside the scope of this PR. The model is currently being tuned and these flags are a first best guess as to what would be good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that works for now. What you have in compile.py could also be a compile_flags.txt flagfile for use with --flagfile={PATH}.

Comment on lines +54 to +60
benchmark_compare_result = iree_benchmark_compare(
[
f"--dump_to_json={benchmark_compare_result_file_path}",
"benchmarks",
str(baseline_benchmark_result_file_path),
benchmark_result_file_path,
]
)
print(benchmark_compare_result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share logs of this running somewhere prominent, like the pull request description. It's hard to judge the value to cost ratio of the third_party/benchmark dep for compare.py and the 500+ line flux_transformer_baseline_benchmark_result_mi300x.json file without that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an example output to the PR description.
I think compare.py is essential to do a proper comparison to determine if 2 benchmarks come from the same distribution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.

@geomin12 has also started refactoring the benchmarks from https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models by moving them to https://github.com/iree-org/iree-test-suites (possibly in this subfolder: https://github.com/iree-org/iree-test-suites/tree/main/sharktank_models).

Given the similarity between FLUX and SDXL, we may want some of this infrastructure to move there too. Starting directly next to the code in this nod-ai/shark-ai repository probably makes sense in the near term, then as model support matures we can move some code around.

I'm mainly concerned about there being a proliferation of different test and benchmark implementations.

  • There should be one well documented way to export, compile, run, and benchmark each model.
  • There should be one central location to view the latest test and benchmark results across all projects.
  • Any developer or user should be able to reproduce test or benchmark results on their own hardware.

I think the way we get there is to streamline our export/compile/run/etc. flows and keep our release process running smoothly.


Here are a few parts of vLLM's benchmark suite to reference:

Note that test cases only specify a model name in the standard huggingface repository naming style then the project knows how to serve that model. We should be able to do the same for SDXL, Flux, Llama, etc.

If we build a serving-focused benchmark suite like vLLM's focused on datacenter GPUs, I could see it living in nod-ai/shark-ai or nod-ai/SHARK-TestSuite. If we want to support other backends or benchmark finer grained details in the IREE compiler and runtime then maybe iree-org/iree-test-suites would be better 🤔.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I am quite happy to see more benchmarks being worked on and added. I'm just worried about there being 7 different test suites, benchmark suites, etc. all working in slightly different ways. Each one has its own advantages and disadvantages so I think we should learn what we can from each and then work to consolidate somewhere. Just need to make those decisions and stand by them.

Copy link

@geomin12 geomin12 Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me take a look at this code and see if there are any ways to combine (from a quick skim, seems almost identical where it's a compile -> benchmark -> compare results)! I'll start writing up the benchmark generic tests then find a simple way to get flux on there (this can be a test to make sure the benchmark code is generic enough to onboard other models!)

this code can go through the PR, then I can add flux to the benchmark tests in iree-test-suite and i can do some cleanup here in another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12, do you suggest we migrate the generic parts of this PR to iree-test-suites?
The important element here is the use of Google benchmark compare.py to do a statistically sound comparison of benchmark results. I think just comparing mean and median is not good enough. Doing this sort of comparison requires keeping the complete baseline benchmark with all repetition results, which may be a lot as you need 20-30 runs to reduce variance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I've been working on this in the iree-test-suites, so it will be a lot simplier to onboard a model and start doing threshold tests.

I've got some work to do to include hugging face and pulling models from there, and I can also incorporate the more sophisticated compare.py tool. Right now, in that repo, we just use golden time/dispatch/size

I think for now, this can stay here (so it doesn't block this PR), then I can work on migrating the flux benchmarks over to the iree-test-suite

Copy link
Contributor Author

@sogartar sogartar Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geomin12, roughly when do you expect to be able to port this benchmarking?
@ScottTodd, should we merge this here in the mean time?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be done next week!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd, should we merge this here in the mean time?

I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.

Comment on lines 30 to 34
if process_result.returncode != 0:
raise RuntimeError(f"stderr:\n{err}\nstdout:\n{out}")

if err != "":
print(err, file=sys.stderr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have other test suites that log the compile/run/benchmark commands before issuing them. I don't see that done here, but maybe it should be. Otherwise, these error messages would be missing some context.

You could also use logging instead of print() and use debug/info/warning/error levels if you expect that logging to be noisy enough that a developer might want to silence some parts of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved logging and the exception message.

@@ -17,12 +17,14 @@
export_flux_transformer_from_hugging_face,
export_flux_transformer,
import_flux_transformer_dataset_from_hugging_face,
iree_compile_flags,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the compile flags in this test file, or maybe add a compile_model.py alongside export.py so there is a standard way of compiling the model that the test can call into.

See also my other comment about the tools requiring less effort to use, along with issues like #691. Generally speaking, tests should be simple: https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html. If the test isn't simple, the system being tested should be changed first. The software we ship (sharktank / shortfin / shark-ai) should know how to compile Flux. These flags in a test show that the software still needs help with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that refactoring is required, but I would like to not include it in this test.

@@ -21,6 +22,8 @@
from ..types import *
from .math import cosine_similarity

is_mi300x = pytest.mark.skipif("config.getoption('iree_hip_target') != 'gfx942'")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving this into the utils file is nice. Could pull that out into its own PR if you want it merged sooner.

Could add similar marks for other cases too.

matrix:
version: [3.11]
fail-fast: false
runs-on: llama-mi300x-3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're in the process of migrating workflows to a new cluster of runner machines (e.g. #971). Might want to sync this with that line of work.

Comment on lines +53 to +66
- name: Install pip deps
run: |
source ${VENV_DIR}/bin/activate
python -m pip install --no-compile --upgrade pip

# Note: We install in three steps in order to satisfy requirements
# from non default locations first.
pip install --no-compile -r pytorch-cpu-requirements.txt
pip install -r requirements-iree-unpinned.txt
pip install --no-compile \
-r sharktank/requirements-tests.txt \
-e sharktank/

pip freeze
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should get to the point where nightly jobs use nightly packages directly, rather than build packages directly like this with pip install -e.

Comment on lines +1 to +3
[submodule "third_party/benchmark"]
path = third_party/benchmark
url = https://github.com/google/benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).

Comment on lines 20 to 22
iree_compile_flags = [
"--iree-hal-target-device=hip",
"--iree-hip-target=gfx942",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that works for now. What you have in compile.py could also be a compile_flags.txt flagfile for use with --flagfile={PATH}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ScottTodd, should we merge this here in the mean time?

I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.

ScottTodd pushed a commit to iree-org/iree-test-suites that referenced this pull request Feb 20, 2025
Migrating [threshold
tests](https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models)
and [benchmark
tests](https://github.com/iree-org/iree/tree/main/experimental/benchmarks/sdxl)
to this repository and also making improvements

Improvements:
- Made a generalized python script for both threshold tests and
benchmark tests, so for current and future models/submodels, you just
need to add/update a JSON file instead of duplicating code
- Resolving [issue 19984](iree-org/iree#19984)
by specifying file path
[here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/sdxl/punet_int8_fp8.json#L69)
and correctly getting the right path
[here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/test_model_threshold.py#L93)
- Improving / creating local development script to run these tests

Things left to do in future PR(s):
- Creating github action script that runs everything (in a midnight run
or PR checks) and runs specific selected models through an action
workflow_dispatch
- Integrating Hugging Face model retrieval, adding Google's benchmark
script and onboarding flux, according to this
[PR](nod-ai/shark-ai#870 (comment))

---------

Signed-off-by: Min <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants