Continued Improvements for Model Testing/Benchmarking #81

geomin12 · 2025-02-26T19:15:43Z

Continuing from previous issue IREE 19115

Currently, the model quality tests and benchmark tests have been migrated from iree to this repo and improved from the previous iteration. However, there can always be more improvements!

These tasks will help with improving the model testing and benchmarking:

Compiler Tasks:

Remove the need for complex compiler flag configurations. Maintaining a long list of flags that varies based on the submodel and data type of the model makes it hard to scale the testing and debug/recreate for the developer
in progress here: IREE Issue 20044 - Remove the need for a spec file for tuning. Currently we require a 300 line spec file from the codegen side to get performance on our halo models. We should find a way to enable these optimizations by default in the compiler.

Testing Tasks:

Currently, bolierplate code to download each file that we need from Azure. Switch to an array of some sort for the halo model. example
Figure out the intake path for pytorch models and have it setup similar to the onnx path
Figure out how we want to store all the model artifacts in this test suite. The artifacts we intake (mlir, onnx files, etc.) shoud all be stored in huggingface with a directory for each model. Have a little readme about where the model is from and what it entails for each huggingface directory.
The tests have become more generalized, pushing away from ROCM/CPU specific setup. However, there are still hints of ROCM/CPU dependency. More generalization of the tests can be done, where ideally, we would be able to onboard any type of model and be able to run quality and benchmark tests regardless of backend/setup
Move the inputs and outputs of quality tests to Git LFS, and source the real weights from huggingface, making the retrieval of files more reliable
For output values, add a non-exact comparison metric, like cross entropy comparison in test_llama.py or perplexity comparison in perplexity iree test

Miscellaneous Tasks:

Currently, we store benchmark results in a markdown file in the GitHub action summary. It would be best to store this data in a database and display in a dashboard, with the ability to see both the benchmark results of a specific run as well as historical benchmark results in order to see trends. As of 2/26/25, this dashboard is in progress and this task will have to be in conjunction with the dashboard work

geomin12 added the enhancement New feature or request label Feb 26, 2025

geomin12 self-assigned this Feb 26, 2025

ScottTodd mentioned this issue Feb 26, 2025

Adding composite github action and generalization work #79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continued Improvements for Model Testing/Benchmarking #81

Continued Improvements for Model Testing/Benchmarking #81

geomin12 commented Feb 26, 2025 •

edited

Loading

Continued Improvements for Model Testing/Benchmarking #81

Continued Improvements for Model Testing/Benchmarking #81

Comments

geomin12 commented Feb 26, 2025 • edited Loading

geomin12 commented Feb 26, 2025 •

edited

Loading