-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Flux transformer benchmarking #870
base: main
Are you sure you want to change the base?
Conversation
23e4919
to
913e9c0
Compare
This PR depends on iree-org/iree-turbine#418. |
913e9c0
to
47e648a
Compare
I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite. |
.github/workflows/ci-sharktank.yml
Outdated
pytest -n 4 sharktank/ --durations=10 | ||
pytest \ | ||
-n 4 \ | ||
--durations=10 \ | ||
-m "not expensive" \ | ||
sharktank/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful when adding new tests, particularly if they are expensive. Developers should be able to run pytest
and expect a reasonable default set of tests to run. Developers will not remember opt-out marks like this.
The filtering also didn't work here? https://github.com/nod-ai/shark-ai/actions/runs/13079961026/job/36501056330?pr=870 is still running on standard github-hosted runners after 2h+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test failure is due to this required PR iree-org/iree-turbine#418.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the default marker selection in pyproject.toml
to exclude expensive
.
[submodule "third_party/benchmark"] | ||
path = third_party/benchmark | ||
url = https://github.com/google/benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually used? You have google-benchmark
in Python test requirements already.
A source dependency on the C++ library could be added to shortfin as needed, probably via FetchContent here:
shark-ai/shortfin/CMakeLists.txt
Lines 326 to 340 in 4eac34e
if(SHORTFIN_BUILD_TESTS) | |
if (NOT SHORTFIN_BUNDLE_DEPS AND NOT SHORTFIN_IREE_SOURCE_DIR) | |
# For now we use gtest shipped alongside with IREE. | |
FetchContent_Declare( | |
googletest | |
URL https://github.com/google/googletest/archive/03597a01ee50ed33e9dfd640b249b4be3799d395.zip | |
) | |
# For Windows: Prevent overriding the parent project's compiler/linker settings | |
set(gtest_force_shared_crt ON CACHE BOOL "" FORCE) | |
FetchContent_MakeAvailable(googletest) | |
endif() | |
include(GoogleTest) | |
enable_testing() | |
add_custom_target(shortfin_testdata_deps) | |
endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrongly left the pip package dependency. It is removed now. Unfortunately, the script to compare benchmark results is not a part of the pip package. This script is used in the python test.
See https://github.com/google/benchmark/blob/main/docs/tools.md#comparepy
That is why the submodule is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd much prefer we consolidate existing nightly workflows (eval, sglang benchmark, llama large, etc.) before adding a new one. That being said, I like the "sharktank-nightly" name better than model-specific names...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the CI jobs are due for a refactoring. To reorganize them and to reduce code duplication. I want to do that next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The advantage of keeping them separately is the ease of tracking various nightly across llm, flux models and sharktank/ shortfin regressions. If there are tools to do this, or a CI summary in github-pages, that would be great.
--verbose \ | ||
--capture=no \ | ||
--iree-hip-target=gfx942 \ | ||
--iree-device=hip://6 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Match what other workflows do with hip devices: #725
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually allocate GPUs to CI jobs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I change it to used hip://0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Generally, CI jobs should be written to run on any compatible runner machine and be easy to reference and run locally. The runners themselves should be responsible for assigning devices.
--html=out/benchmark/index.html \ | ||
sharktank/tests | ||
|
||
- name: Deploy to GitHub Pages | ||
uses: peaceiris/actions-gh-pages@4f9cc6602d3f66b9c108549d475ec49e8ef4d45e # v4.0.0 | ||
with: | ||
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }} | ||
publish_dir: ./out/benchmark | ||
destination_dir: ./benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"benchmark" isn't a very specific name, and it also doesn't match "sharktank nightly". Consider how you want this to slot in to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a PR for the GH pages #899.
with: | ||
github_token: ${{ secrets.SHARK_PLATFORM_GH_TOKEN }} | ||
publish_dir: ./out/sharktank-nightly | ||
destination_dir: ./sharktank-nightly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe publish_dir
should be ./out/sharktank_nightly/benchmark
. Same for destination_dir
Also, we need to update sharktank-nightly
to sharktank_nightly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have one step to push all sharktank_nightly artifacts. This workflow in the future may have more jobs that produce artifacts. I would be surprised if the action does not support whole directory trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, were you able to test this nightly on this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow is not yet available and I can't trigger it manually. I will trigger it once it is on main.
Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own. Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.
88aad03
to
f88784c
Compare
iree_compile_flags = [ | ||
"--iree-hal-target-device=hip", | ||
"--iree-hip-target=gfx942", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does export.py
include flags for compiling? I would expect this file to only export from huggingface / PyTorch / etc. to MLIR and not go further.
If including flags, do not overspecialize on a single target (hip) or type of device (gfx942). We should also be reducing these lists of flags down to only the most essential (the target backend and feature set), so putting 15 extra flags this far up the pipeline will make it more difficult to scale this work across other targets and models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the args to a new compile.py
file. I agree that refactoring the compilation process is needed, but I think it is outside the scope of this PR. The model is currently being tuned and these flags are a first best guess as to what would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that works for now. What you have in compile.py
could also be a compile_flags.txt
flagfile for use with --flagfile={PATH}
.
benchmark_compare_result = iree_benchmark_compare( | ||
[ | ||
f"--dump_to_json={benchmark_compare_result_file_path}", | ||
"benchmarks", | ||
str(baseline_benchmark_result_file_path), | ||
benchmark_result_file_path, | ||
] | ||
) | ||
print(benchmark_compare_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please share logs of this running somewhere prominent, like the pull request description. It's hard to judge the value to cost ratio of the third_party/benchmark
dep for compare.py
and the 500+ line flux_transformer_baseline_benchmark_result_mi300x.json
file without that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an example output to the PR description.
I think compare.py
is essential to do a proper comparison to determine if 2 benchmarks come from the same distribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should put the benchmarks in this repo or at https://github.com/nod-ai/SHARK-TestSuite.
@geomin12 has also started refactoring the benchmarks from https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models by moving them to https://github.com/iree-org/iree-test-suites (possibly in this subfolder: https://github.com/iree-org/iree-test-suites/tree/main/sharktank_models).
Given the similarity between FLUX and SDXL, we may want some of this infrastructure to move there too. Starting directly next to the code in this nod-ai/shark-ai repository probably makes sense in the near term, then as model support matures we can move some code around.
I'm mainly concerned about there being a proliferation of different test and benchmark implementations.
- There should be one well documented way to export, compile, run, and benchmark each model.
- There should be one central location to view the latest test and benchmark results across all projects.
- Any developer or user should be able to reproduce test or benchmark results on their own hardware.
I think the way we get there is to streamline our export/compile/run/etc. flows and keep our release process running smoothly.
Here are a few parts of vLLM's benchmark suite to reference:
- https://github.com/vllm-project/vllm/tree/main/.buildkite/nightly-benchmarks
- https://github.com/vllm-project/vllm/blob/main/.buildkite/nightly-benchmarks/tests/latency-tests.json
- https://github.com/vllm-project/vllm/blob/main/.buildkite/run-benchmarks.sh
Note that test cases only specify a model name in the standard huggingface repository naming style then the project knows how to serve that model. We should be able to do the same for SDXL, Flux, Llama, etc.
If we build a serving-focused benchmark suite like vLLM's focused on datacenter GPUs, I could see it living in nod-ai/shark-ai or nod-ai/SHARK-TestSuite. If we want to support other backends or benchmark finer grained details in the IREE compiler and runtime then maybe iree-org/iree-test-suites would be better 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I am quite happy to see more benchmarks being worked on and added. I'm just worried about there being 7 different test suites, benchmark suites, etc. all working in slightly different ways. Each one has its own advantages and disadvantages so I think we should learn what we can from each and then work to consolidate somewhere. Just need to make those decisions and stand by them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me take a look at this code and see if there are any ways to combine (from a quick skim, seems almost identical where it's a compile -> benchmark -> compare results)! I'll start writing up the benchmark generic tests then find a simple way to get flux on there (this can be a test to make sure the benchmark code is generic enough to onboard other models!)
this code can go through the PR, then I can add flux to the benchmark tests in iree-test-suite and i can do some cleanup here in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geomin12, do you suggest we migrate the generic parts of this PR to iree-test-suites?
The important element here is the use of Google benchmark compare.py to do a statistically sound comparison of benchmark results. I think just comparing mean and median is not good enough. Doing this sort of comparison requires keeping the complete baseline benchmark with all repetition results, which may be a lot as you need 20-30 runs to reduce variance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! I've been working on this in the iree-test-suites, so it will be a lot simplier to onboard a model and start doing threshold tests.
I've got some work to do to include hugging face and pulling models from there, and I can also incorporate the more sophisticated compare.py
tool. Right now, in that repo, we just use golden time/dispatch/size
I think for now, this can stay here (so it doesn't block this PR), then I can work on migrating the flux benchmarks over to the iree-test-suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geomin12, roughly when do you expect to be able to port this benchmarking?
@ScottTodd, should we merge this here in the mean time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be done next week!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd, should we merge this here in the mean time?
I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.
if process_result.returncode != 0: | ||
raise RuntimeError(f"stderr:\n{err}\nstdout:\n{out}") | ||
|
||
if err != "": | ||
print(err, file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have other test suites that log the compile/run/benchmark commands before issuing them. I don't see that done here, but maybe it should be. Otherwise, these error messages would be missing some context.
You could also use logging
instead of print()
and use debug/info/warning/error levels if you expect that logging to be noisy enough that a developer might want to silence some parts of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I improved logging and the exception message.
@@ -17,12 +17,14 @@ | |||
export_flux_transformer_from_hugging_face, | |||
export_flux_transformer, | |||
import_flux_transformer_dataset_from_hugging_face, | |||
iree_compile_flags, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the compile flags in this test file, or maybe add a compile_model.py
alongside export.py
so there is a standard way of compiling the model that the test can call into.
See also my other comment about the tools requiring less effort to use, along with issues like #691. Generally speaking, tests should be simple: https://testing.googleblog.com/2014/07/testing-on-toilet-dont-put-logic-in.html. If the test isn't simple, the system being tested should be changed first. The software we ship (sharktank / shortfin / shark-ai) should know how to compile Flux. These flags in a test show that the software still needs help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that refactoring is required, but I would like to not include it in this test.
@@ -21,6 +22,8 @@ | |||
from ..types import * | |||
from .math import cosine_similarity | |||
|
|||
is_mi300x = pytest.mark.skipif("config.getoption('iree_hip_target') != 'gfx942'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this into the utils file is nice. Could pull that out into its own PR if you want it merged sooner.
Could add similar marks for other cases too.
f88784c
to
1c5c28c
Compare
matrix: | ||
version: [3.11] | ||
fail-fast: false | ||
runs-on: llama-mi300x-3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're in the process of migrating workflows to a new cluster of runner machines (e.g. #971). Might want to sync this with that line of work.
- name: Install pip deps | ||
run: | | ||
source ${VENV_DIR}/bin/activate | ||
python -m pip install --no-compile --upgrade pip | ||
|
||
# Note: We install in three steps in order to satisfy requirements | ||
# from non default locations first. | ||
pip install --no-compile -r pytorch-cpu-requirements.txt | ||
pip install -r requirements-iree-unpinned.txt | ||
pip install --no-compile \ | ||
-r sharktank/requirements-tests.txt \ | ||
-e sharktank/ | ||
|
||
pip freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually we should get to the point where nightly jobs use nightly packages directly, rather than build packages directly like this with pip install -e
.
[submodule "third_party/benchmark"] | ||
path = third_party/benchmark | ||
url = https://github.com/google/benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. Unless we standardize on the googlebenchmark source across this project, I'd hesitate to add the new dependency here since it adds complexity (these benchmarks use these python libraries, these other benchmarks use a git submodule, etc.).
iree_compile_flags = [ | ||
"--iree-hal-target-device=hip", | ||
"--iree-hip-target=gfx942", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that works for now. What you have in compile.py
could also be a compile_flags.txt
flagfile for use with --flagfile={PATH}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ScottTodd, should we merge this here in the mean time?
I think this new code is reasonably tightly scoped to flux... but, I do worry about test/benchmark infrastructure fragmentation during a period when we're already churning with some other in-progress refactors. I guess what I'm really looking for across https://github.com/nod-ai/shark-ai is a high level plan for each model test/benchmark and a person or group driving that plan.
Migrating [threshold tests](https://github.com/iree-org/iree/tree/main/experimental/regression_suite/shark-test-suite-models) and [benchmark tests](https://github.com/iree-org/iree/tree/main/experimental/benchmarks/sdxl) to this repository and also making improvements Improvements: - Made a generalized python script for both threshold tests and benchmark tests, so for current and future models/submodels, you just need to add/update a JSON file instead of duplicating code - Resolving [issue 19984](iree-org/iree#19984) by specifying file path [here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/sdxl/punet_int8_fp8.json#L69) and correctly getting the right path [here](https://github.com/geomin12/iree-test-suites/blob/da44d58a316018c2a87da182609ca4819e2cd50a/sharktank_models/test_suite/regression_tests/test_model_threshold.py#L93) - Improving / creating local development script to run these tests Things left to do in future PR(s): - Creating github action script that runs everything (in a midnight run or PR checks) and runs specific selected models through an action workflow_dispatch - Integrating Hugging Face model retrieval, adding Google's benchmark script and onboarding flux, according to this [PR](nod-ai/shark-ai#870 (comment)) --------- Signed-off-by: Min <[email protected]>
Add also some more general functionality to check benchmark results against baseline results. This requires the Google Benchmark compare.py tool that is not a part of the pip package. That is why I added the repo as a git submodule. This tool does a statistical comparison between benchmarks with proper p-value calculation. I don't think we should roll out our own.
Adds a new nightly CI job that should contain nightly tests and benchmarks that are not in their own category like Llama is now.
Example log from a benchmark comparison: