Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continued Improvements for Model Testing/Benchmarking #81

Open
9 tasks
geomin12 opened this issue Feb 26, 2025 · 0 comments
Open
9 tasks

Continued Improvements for Model Testing/Benchmarking #81

geomin12 opened this issue Feb 26, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@geomin12
Copy link
Contributor

geomin12 commented Feb 26, 2025

Continuing from previous issue IREE 19115

Currently, the model quality tests and benchmark tests have been migrated from iree to this repo and improved from the previous iteration. However, there can always be more improvements!

These tasks will help with improving the model testing and benchmarking:

Compiler Tasks:

  • Remove the need for complex compiler flag configurations. Maintaining a long list of flags that varies based on the submodel and data type of the model makes it hard to scale the testing and debug/recreate for the developer
  • in progress here: IREE Issue 20044 - Remove the need for a spec file for tuning. Currently we require a 300 line spec file from the codegen side to get performance on our halo models. We should find a way to enable these optimizations by default in the compiler.

Testing Tasks:

  • Currently, bolierplate code to download each file that we need from Azure. Switch to an array of some sort for the halo model. example
  • Figure out the intake path for pytorch models and have it setup similar to the onnx path
  • Figure out how we want to store all the model artifacts in this test suite. The artifacts we intake (mlir, onnx files, etc.) shoud all be stored in huggingface with a directory for each model. Have a little readme about where the model is from and what it entails for each huggingface directory.
  • The tests have become more generalized, pushing away from ROCM/CPU specific setup. However, there are still hints of ROCM/CPU dependency. More generalization of the tests can be done, where ideally, we would be able to onboard any type of model and be able to run quality and benchmark tests regardless of backend/setup
  • Move the inputs and outputs of quality tests to Git LFS, and source the real weights from huggingface, making the retrieval of files more reliable
  • For output values, add a non-exact comparison metric, like cross entropy comparison in test_llama.py or perplexity comparison in perplexity iree test

Miscellaneous Tasks:

  • Currently, we store benchmark results in a markdown file in the GitHub action summary. It would be best to store this data in a database and display in a dashboard, with the ability to see both the benchmark results of a specific run as well as historical benchmark results in order to see trends. As of 2/26/25, this dashboard is in progress and this task will have to be in conjunction with the dashboard work
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant