Add Flux transformer benchmarking #870

sogartar · 2025-01-27T14:26:30Z

Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own.

Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.

Example log from a benchmark comparison:

Comparing /home/bpetkant/ws/sharktank/repo/sharktank/tests/models/flux/flux_transformer_baseline_benchmark_result_mi300x.json to /home/bpetkant/ws/sharktank/experiments/flux/benchmark/benchmark_result.json
Benchmark                                                      Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------
BM_forward_bs1/process_time/real_time                       +0.0451         +0.0932           549           574           564           616
...
BM_forward_bs1/process_time/real_time                       +0.2122         +0.2508           554           672           569           711
BM_forward_bs1/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 30 vs 30
BM_forward_bs1/process_time/real_time_mean                  +0.0457         +0.0922           552           577           567           619
BM_forward_bs1/process_time/real_time_median                +0.0400         +0.0870           552           574           567           616
BM_forward_bs1/process_time/real_time_stddev               +12.3302        +11.8167             1            18             1            17
BM_forward_bs1/process_time/real_time_cv                   +11.7470        +10.7347             0             0             0             0
OVERALL_GEOMEAN                                             +0.0453         +0.0918             1             1             1             1

...

E           AssertionError: Benchmark contender mean real time 577.4150719943767 ms is worse than baseline 552.1548928634729 ms.

sogartar · 2025-01-27T14:38:38Z

This PR depends on iree-org/iree-turbine#418.

sogartar · 2025-01-27T16:36:45Z

I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.

ScottTodd · 2025-01-31T22:14:01Z

.github/workflows/ci-sharktank.yml

-          pytest -n 4 sharktank/ --durations=10
+          pytest \
+            -n 4 \
+            --durations=10 \
+            -m "not expensive" \
+            sharktank/


Be careful when adding new tests, particularly if they are expensive. Developers should be able to run pytest and expect a reasonable default set of tests to run. Developers will not remember opt-out marks like this.

The filtering also didn't work here? https://github.com/nod-ai/shark-ai/actions/runs/13079961026/job/36501056330?pr=870 is still running on standard github-hosted runners after 2h+.

The test failure is due to this required PR iree-org/iree-turbine#418.

I changed the default marker selection in pyproject.toml to exclude expensive.

ScottTodd · 2025-01-31T22:15:55Z

.gitmodules

+[submodule "third_party/benchmark"]
+	path = third_party/benchmark
+	url = https://github.com/google/benchmark


Is this actually used? You have google-benchmark in Python test requirements already.

A source dependency on the C++ library could be added to shortfin as needed, probably via FetchContent here:

shark-ai/shortfin/CMakeLists.txt

Lines 326 to 340 in 4eac34e

if(SHORTFIN_BUILD_TESTS)

if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR)

# For now we use gtest shipped alongside with IREE.

FetchContent_Declare(

googletest

URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip

)

# For Windows: Prevent overriding the parent project's compiler/linker settings

set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)

FetchContent_MakeAvailable(googletest)

endif()

include(GoogleTest)

enable_testing()

add_custom_target(shortfin_testdata_deps)

endif()

I wrongly left the pip package dependency. It is removed now. Unfortunately, the script to compare benchmark results is not a part of the pip package. This script is used in the python test.
See https://github.com/google/benchmark/blob/main/docs/tools.md#comparepy
That is why the submodule is used.

Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).

ScottTodd · 2025-01-31T22:17:36Z

.github/workflows/ci-sharktank-nightly.yaml

I'd much prefer we consolidate existing nightly workflows (eval, sglang benchmark, llama large, etc.) before adding a new one. That being said, I like the "sharktank-nightly" name better than model-specific names...

I think the CI jobs are due for a refactoring. To reorganize them and to reduce code duplication. I want to do that next.

The advantage of keeping them separately is the ease of tracking various nightly across llm, flux models and sharktank/ shortfin regressions. If there are tools to do this, or a CI summary in github-pages, that would be great.

ScottTodd · 2025-01-31T22:18:51Z

.github/workflows/ci-sharktank-nightly.yaml

+            --verbose \
+            --capture=no \
+            --iree-hip-target=gfx942 \
+            --iree-device=hip://6 \


Match what other workflows do with hip devices: #725

Do we actually allocate GPUs to CI jobs?

I change it to used hip://0.

Thanks. Generally, CI jobs should be written to run on any compatible runner machine and be easy to reference and run locally. The runners themselves should be responsible for assigning devices.

ScottTodd · 2025-01-31T22:20:35Z

.github/workflows/ci-sharktank-nightly.yaml

+            --html=out/benchmark/index.html \
+            sharktank/tests
+
+      - name: Deploy to GitHub Pages
+        uses: peaceiris/actions-gh-pages@4f9cc6602d3f66b9c108549d475ec49e8ef4d45e # v4.0.0
+        with:
+          github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
+          publish_dir: ./out/benchmark
+          destination_dir: ./benchmark


"benchmark" isn't a very specific name, and it also doesn't match "sharktank nightly". Consider how you want this to slot in to

https://github.com/nod-ai/shark-ai/tree/gh-pages

https://nod-ai.github.io/shark-ai/

Here is a PR for the GH pages #899.

.github/workflows/ci-sharktank-nightly.yaml

archana-ramalingam · 2025-02-03T20:24:38Z

.github/workflows/ci-sharktank-nightly.yaml

+        with:
+          github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }}
+          publish_dir: ./out/sharktank-nightly
+          destination_dir: ./sharktank-nightly


I believe publish_dir should be ./out/sharktank_nightly/benchmark. Same for destination_dir
Also, we need to update sharktank-nightly to sharktank_nightly

I wanted to have one step to push all sharktank_nightly artifacts. This workflow in the future may have more jobs that produce artifacts. I would be surprised if the action does not support whole directory trees.

Sure, were you able to test this nightly on this PR?

The workflow is not yet available and I can't trigger it manually. I will trigger it once it is on main.

Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own. Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.

ScottTodd · 2025-02-13T16:43:12Z

sharktank/sharktank/models/flux/export.py

+iree_compile_flags = [
+    "--iree-hal-target-device=hip",
+    "--iree-hip-target=gfx942",


Why does export.py include flags for compiling? I would expect this file to only export from huggingface / PyTorch / etc. to MLIR and not go further.

If including flags, do not overspecialize on a single target (hip) or type of device (gfx942). We should also be reducing these lists of flags down to only the most essential (the target backend and feature set), so putting 15 extra flags this far up the pipeline will make it more difficult to scale this work across other targets and models.

I moved the args to a new compile.py file. I agree that refactoring the compilation process is needed, but I think it is outside the scope of this PR. The model is currently being tuned and these flags are a first best guess as to what would be good.

I guess that works for now. What you have in compile.py could also be a compile_flags.txt flagfile for use with --flagfile={PATH}.

ScottTodd · 2025-02-13T16:45:47Z

sharktank/tests/models/flux/flux_benchmark_test.py

+        benchmark_compare_result = iree_benchmark_compare(
+            [
+                f"--dump_to_json={benchmark_compare_result_file_path}",
+                "benchmarks",
+                str(baseline_benchmark_result_file_path),
+                benchmark_result_file_path,
+            ]
+        )
+        print(benchmark_compare_result)


Please share logs of this running somewhere prominent, like the pull request description. It's hard to judge the value to cost ratio of the third_party/benchmark dep for compare.py and the 500+ line flux_transformer_baseline_benchmark_result_mi300x.json file without that.

I added an example output to the PR description.
I think compare.py is essential to do a proper comparison to determine if 2 benchmarks come from the same distribution.

ScottTodd · 2025-02-13T17:04:37Z

sharktank/sharktank/utils/benchmark.py

I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.

@geomin12 has also started refactoring the benchmarks from https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models by moving them to https://github.com/iree-org/iree-test-suites (possibly in this subfolder: https://github.com/iree-org/iree-test-suites/tree/main/sharktank_models).

Given the similarity between FLUX and SDXL, we may want some of this infrastructure to move there too. Starting directly next to the code in this nod-ai/shark-ai repository probably makes sense in the near term, then as model support matures we can move some code around.

I'm mainly concerned about there being a proliferation of different test and benchmark implementations.

There should be one well documented way to export, compile, run, and benchmark each model.

There should be one central location to view the latest test and benchmark results across all projects.

Any developer or user should be able to reproduce test or benchmark results on their own hardware.

I think the way we get there is to streamline our export/compile/run/etc. flows and keep our release process running smoothly.

Here are a few parts of vLLM's benchmark suite to reference:

https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks

https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/tests/latency-tests.json

https://github.com/vllm-project/vllm/blob/main/.buildkite/run-benchmarks.sh

Note that test cases only specify a model name in the standard huggingface repository naming style then the project knows how to serve that model. We should be able to do the same for SDXL, Flux, Llama, etc.

If we build a serving-focused benchmark suite like vLLM's focused on datacenter GPUs, I could see it living in nod-ai/shark-ai or nod-ai/SHARK-TestSuite. If we want to support other backends or benchmark finer grained details in the IREE compiler and runtime then maybe iree-org/iree-test-suites would be better 🤔.

To be clear, I am quite happy to see more benchmarks being worked on and added. I'm just worried about there being 7 different test suites, benchmark suites, etc. all working in slightly different ways. Each one has its own advantages and disadvantages so I think we should learn what we can from each and then work to consolidate somewhere. Just need to make those decisions and stand by them.

let me take a look at this code and see if there are any ways to combine (from a quick skim, seems almost identical where it's a compile -> benchmark -> compare results)! I'll start writing up the benchmark generic tests then find a simple way to get flux on there (this can be a test to make sure the benchmark code is generic enough to onboard other models!)

this code can go through the PR, then I can add flux to the benchmark tests in iree-test-suite and i can do some cleanup here in another PR

@geomin12, do you suggest we migrate the generic parts of this PR to iree-test-suites?
The important element here is the use of Google benchmark compare.py to do a statistically sound comparison of benchmark results. I think just comparing mean and median is not good enough. Doing this sort of comparison requires keeping the complete baseline benchmark with all repetition results, which may be a lot as you need 20-30 runs to reduce variance.

yes! I've been working on this in the iree-test-suites, so it will be a lot simplier to onboard a model and start doing threshold tests.

I've got some work to do to include hugging face and pulling models from there, and I can also incorporate the more sophisticated compare.py tool. Right now, in that repo, we just use golden time/dispatch/size

I think for now, this can stay here (so it doesn't block this PR), then I can work on migrating the flux benchmarks over to the iree-test-suite

@geomin12, roughly when do you expect to be able to port this benchmarking?
@ScottTodd, should we merge this here in the mean time?

will be done next week!

@ScottTodd, should we merge this here in the mean time?

I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.

ScottTodd · 2025-02-13T17:09:43Z

sharktank/sharktank/utils/benchmark.py

+    if process_result.returncode != 0:
+        raise RuntimeError(f"stderr:\n{err}\nstdout:\n{out}")
+
+    if err != "":
+        print(err, file=sys.stderr)


We have other test suites that log the compile/run/benchmark commands before issuing them. I don't see that done here, but maybe it should be. Otherwise, these error messages would be missing some context.

You could also use logging instead of print() and use debug/info/warning/error levels if you expect that logging to be noisy enough that a developer might want to silence some parts of it.

I improved logging and the exception message.

ScottTodd · 2025-02-13T17:15:21Z

sharktank/tests/models/flux/flux_test.py

@@ -17,12 +17,14 @@
    export_flux_transformer_from_hugging_face,
    export_flux_transformer,
    import_flux_transformer_dataset_from_hugging_face,
+    iree_compile_flags,


Keep the compile flags in this test file, or maybe add a compile_model.py alongside export.py so there is a standard way of compiling the model that the test can call into.

See also my other comment about the tools requiring less effort to use, along with issues like #691. Generally speaking, tests should be simple: https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html. If the test isn't simple, the system being tested should be changed first. The software we ship (sharktank / shortfin / shark-ai) should know how to compile Flux. These flags in a test show that the software still needs help with that.

I agree that refactoring is required, but I would like to not include it in this test.

ScottTodd · 2025-02-13T17:16:42Z

sharktank/sharktank/utils/testing.py

@@ -21,6 +22,8 @@
 from ..types import *
 from .math import cosine_similarity

+is_mi300x = pytest.mark.skipif("config.getoption('iree_hip_target') != 'gfx942'")


Moving this into the utils file is nice. Could pull that out into its own PR if you want it merged sooner.

Could add similar marks for other cases too.

ScottTodd · 2025-02-17T19:22:44Z

.github/workflows/ci-sharktank-nightly.yaml

+      matrix:
+        version: [3.11]
+      fail-fast: false
+    runs-on: llama-mi300x-3


We're in the process of migrating workflows to a new cluster of runner machines (e.g. #971). Might want to sync this with that line of work.

ScottTodd · 2025-02-17T19:23:42Z

.github/workflows/ci-sharktank-nightly.yaml

+      - name: Install pip deps
+        run: |
+          source ${VENV_DIR}/bin/activate
+          python -m pip install --no-compile --upgrade pip
+
+          # Note: We install in three steps in order to satisfy requirements
+          # from non default locations first.
+          pip install --no-compile -r pytorch-cpu-requirements.txt
+          pip install -r requirements-iree-unpinned.txt
+          pip install --no-compile \
+            -r sharktank/requirements-tests.txt \
+            -e sharktank/
+
+          pip freeze


Eventually we should get to the point where nightly jobs use nightly packages directly, rather than build packages directly like this with pip install -e.

ScottTodd · 2025-02-17T19:25:56Z

.gitmodules

+[submodule "third_party/benchmark"]
+	path = third_party/benchmark
+	url = https://github.com/google/benchmark


Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).

ScottTodd · 2025-02-17T19:29:05Z

sharktank/sharktank/models/flux/export.py

+iree_compile_flags = [
+    "--iree-hal-target-device=hip",
+    "--iree-hip-target=gfx942",


I guess that works for now. What you have in compile.py could also be a compile_flags.txt flagfile for use with --flagfile={PATH}.

ScottTodd · 2025-02-17T19:31:06Z

sharktank/sharktank/utils/benchmark.py

@ScottTodd, should we merge this here in the mean time?

I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.

Migrating [threshold tests](https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models) and [benchmark tests](https://github.com/iree-org/iree/tree/main/experimental/benchmarks/sdxl) to this repository and also making improvements Improvements: - Made a generalized python script for both threshold tests and benchmark tests, so for current and future models/submodels, you just need to add/update a JSON file instead of duplicating code - Resolving [issue 19984](iree-org/iree#19984) by specifying file path [here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/sdxl/punet_int8_fp8.json#L69) and correctly getting the right path [here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/test_model_threshold.py#L93) - Improving / creating local development script to run these tests Things left to do in future PR(s): - Creating github action script that runs everything (in a midnight run or PR checks) and runs specific selected models through an action workflow_dispatch - Integrating Hugging Face model retrieval, adding Google's benchmark script and onboarding flux, according to this [PR](nod-ai/shark-ai#870 (comment)) --------- Signed-off-by: Min <[email protected]>

sogartar force-pushed the flux-transformer-benchmark branch from 23e4919 to 913e9c0 Compare January 27, 2025 14:34

sogartar marked this pull request as ready for review January 27, 2025 14:39

sogartar requested review from KyleHerndon, rsuderman and IanNod January 27, 2025 14:39

sogartar marked this pull request as draft January 27, 2025 15:03

sogartar force-pushed the flux-transformer-benchmark branch from 913e9c0 to 47e648a Compare January 27, 2025 15:11

sogartar marked this pull request as ready for review January 27, 2025 15:25

ScottTodd requested changes Jan 31, 2025

View reviewed changes

sogartar requested a review from ScottTodd February 3, 2025 16:48

archana-ramalingam reviewed Feb 3, 2025

View reviewed changes

.github/workflows/ci-sharktank-nightly.yaml Outdated Show resolved Hide resolved

archana-ramalingam reviewed Feb 3, 2025

View reviewed changes

sogartar requested a review from archana-ramalingam February 5, 2025 20:22

sogartar added 4 commits February 10, 2025 16:50

Remove commented out code

7435bb6

Address Scott's review comments

07890f6

sharktank-nightly -> sharktank_nightly

88aad03

sogartar force-pushed the flux-transformer-benchmark branch 2 times, most recently from 88aad03 to f88784c Compare February 13, 2025 13:55

ScottTodd requested changes Feb 13, 2025

View reviewed changes

sogartar added 3 commits February 14, 2025 15:52

Move iree_compile_flags to separate compile module

dfadbc2

Remove unused imports

3f707be

Improve run program error reporting

1c5c28c

sogartar force-pushed the flux-transformer-benchmark branch from f88784c to 1c5c28c Compare February 14, 2025 16:28

geomin12 mentioned this pull request Feb 14, 2025

Initial migration of iree regression tests iree-org/iree-test-suites#77

Merged

sogartar requested a review from ScottTodd February 15, 2025 00:06

ScottTodd reviewed Feb 17, 2025

View reviewed changes

	if(SHORTFIN_BUILD_TESTS)
	if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR)
	# For now we use gtest shipped alongside with IREE.
	FetchContent_Declare(
	googletest
	URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip
	)
	# For Windows: Prevent overriding the parent project's compiler/linker settings
	set(gtest_force_shared_crt ON CACHE BOOL "" FORCE)
	FetchContent_MakeAvailable(googletest)
	endif()
	include(GoogleTest)
	enable_testing()
	add_custom_target(shortfin_testdata_deps)
	endif()

Add Flux transformer benchmarking #870

Are you sure you want to change the base?

Add Flux transformer benchmarking #870

Conversation

sogartar commented Jan 27, 2025 • edited Loading

sogartar commented Jan 27, 2025

sogartar commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar Feb 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geomin12 Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar Feb 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar commented Jan 27, 2025 •

edited

Loading

sogartar Feb 5, 2025 •

edited

Loading

geomin12 Feb 13, 2025 •

edited

Loading

sogartar Feb 15, 2025 •

edited

Loading