You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the model quality tests and benchmark tests have been migrated from iree to this repo and improved from the previous iteration. However, there can always be more improvements!
These tasks will help with improving the model testing and benchmarking:
Compiler Tasks:
Remove the need for complex compiler flag configurations. Maintaining a long list of flags that varies based on the submodel and data type of the model makes it hard to scale the testing and debug/recreate for the developer
in progress here: IREE Issue 20044 - Remove the need for a spec file for tuning. Currently we require a 300 line spec file from the codegen side to get performance on our halo models. We should find a way to enable these optimizations by default in the compiler.
Testing Tasks:
Currently, bolierplate code to download each file that we need from Azure. Switch to an array of some sort for the halo model. example
Figure out the intake path for pytorch models and have it setup similar to the onnx path
Figure out how we want to store all the model artifacts in this test suite. The artifacts we intake (mlir, onnx files, etc.) shoud all be stored in huggingface with a directory for each model. Have a little readme about where the model is from and what it entails for each huggingface directory.
The tests have become more generalized, pushing away from ROCM/CPU specific setup. However, there are still hints of ROCM/CPU dependency. More generalization of the tests can be done, where ideally, we would be able to onboard any type of model and be able to run quality and benchmark tests regardless of backend/setup
Move the inputs and outputs of quality tests to Git LFS, and source the real weights from huggingface, making the retrieval of files more reliable
For output values, add a non-exact comparison metric, like cross entropy comparison in test_llama.py or perplexity comparison in perplexity iree test
Miscellaneous Tasks:
Currently, we store benchmark results in a markdown file in the GitHub action summary. It would be best to store this data in a database and display in a dashboard, with the ability to see both the benchmark results of a specific run as well as historical benchmark results in order to see trends. As of 2/26/25, this dashboard is in progress and this task will have to be in conjunction with the dashboard work
The text was updated successfully, but these errors were encountered:
Continuing from previous issue IREE 19115
Currently, the model quality tests and benchmark tests have been migrated from iree to this repo and improved from the previous iteration. However, there can always be more improvements!
These tasks will help with improving the model testing and benchmarking:
Compiler Tasks:
Testing Tasks:
Miscellaneous Tasks:
The text was updated successfully, but these errors were encountered: