-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark Llama 3.1 f16 and fp8 with CI #284
Conversation
artifacts_dir = "/data/extra/models/llama3.1_8B/" | ||
self.irpa_path = artifacts_dir + "llama8b_f16.irpa" | ||
self.irpa_path_fp8 = artifacts_dir + "llama8b_fp8.irpa" | ||
self.output_mlir = self.repo_root + "llama8b_f16.mlir" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output files should be temporary. We don't want them lasting or being picked up accidentally on subsequent runs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a function to remove the output files at the end of each test
export_args = [ | ||
"python3", | ||
"-m", | ||
"sharktank.examples.export_paged_llm_v1", | ||
"--irpa-file", | ||
irpa_path, | ||
"--output-mlir", | ||
output_mlir_path, | ||
"--output-config", | ||
output_json_path, | ||
] | ||
if attention_kernel in ["decomposed", "torch_sdpa"]: | ||
export_args.append("--attention-kernel") | ||
export_args.append(attention_kernel) | ||
|
||
cmd = subprocess.list2cmdline(export_args) | ||
return cmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be separate from this PR, but we should use an in-memory API for all model exporting/importing workflows, rather than fork into a subprocess.
Users shouldn't need to chain together python -m sharktank.examples.
and iree-compile ...
commands. We can aim for something like https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
(that's as minimal as it gets - we'll want to pass options like the compilation target though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll try to update it in the next PR
Updating the branch with changes from |
088bde7
to
23e23ee
Compare
if return_code != 0: | ||
raise Exception(f"{cmd} failed to run") | ||
|
||
def cleanup_output_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to use tempfile like done here https://github.com/nod-ai/SHARK-Platform/blob/0c2e965c3ffe723db2fe2be9193c6d45fe558dbe/sharktank/tests/types/tensors_test.py#L71. Should better handle potential multiple async runs effectively corrupting the data in these effectively hardcoded paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to save the output files to track any regressions right? So, do we still need tempfile? I think os.mkdir()
would make more sense to use in this case, especially when specifying the directory path of the artifiacts to upload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can upload the output artifacts before tempfile gets rid of them. I don't think we want to keep the files on the CI system permanently. os.mkdir could still have issues with multiple simultaneous instances running interfering with each other unless you are very careful with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I upload the output artifacts before tempfile gets rid of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/nod-ai/SHARK-TestSuite/blob/95b35cc076618ef79b7f84dc4117f4528451e5e6/e2eshark/_run_helper.py#L61 for example uploading files. You should just need to do that before closing the tempfile or directory that is created
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we prefer sharkpublic? We will be uploading a bunch of artifacts there every night, and it might be hard to correlate which run correlates to which files. With github actions, we can just click on a run and see it's respective output artifacts without cluttering the storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does github actions work well for downloading things like vmfbs? If so, then that would work well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried uploading vmfbs before as artifacts, but we can check before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much data are we talking about here? Cloud storage isn't cheap. Using GitHub artifacts with GitHub workflows is nice since it doesn't require separate privileges to maintain, but then you're limited by GitHub's retention policies.
If you want to keep historical data in an organized way, please design a database early. I've seen too many benchmarking systems without that structure and it makes analysis tricky.
For this PR itself, I'd try to keep the scope small and checkpoint the work, rather than get hung up on all these discussions. Fine to land and iterate a bit so things don't pile up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's start with github artifacts without a temp dir (it won't persist between runs anyways) and iterate from there
) | ||
self.cleanup_output_files(output_mlir, output_json, output_vmfb) | ||
|
||
@pytest.mark.skip(reason="TODO: Need to plumb through attention_kernel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why skip instead of xfail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the test is still able to run successfully since it is still able to export an IR and compile and benchmark fully through, however, it is not exporting the correct IR yet. The attention_kernel support needs to be added for it to execute as expected. If I mark the test as XFAIL, it will still output benchmark numbers for 8B f16 decomposed even though we want the non-decomposed numbers in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should raise an error when calling the attention_kernel flag with torch_sdpa until that is plumbed through and then we can xfail here
self.iree_run_decode_args, | ||
self.repo_root, | ||
) | ||
self.cleanup_output_files(output_mlir, output_json, output_vmfb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To more easily handle regressions and such I think we really want to upload and store the model IR, vmfb, and dispatch dump so we can triage exactly what this benchmark used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I will save the temporary files into the github actions file then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was kind of thinking of just having it upload to sharkpublic since that would be easy to access for those who want them. Not sure if github actions file would be better or not
def setUp(self): | ||
# TODO: add numpy files to Azure and download from it | ||
self.repo_root = os.getenv("SHARK_PLATFORM_REPO_ROOT") | ||
artifacts_dir = "/data/extra/models/llama3.1_8B/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likely need at least a comment that this is system specific for A30F. Anybody trying to replicate locally will need to know how to recreate. Better would probably be adding an arg to pass this path that defaults to this for the runner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General workflow tip for tests / CI: start by making tests runnable by a developer, then teach a CI machine how to follow those steps. We should never have a hardcoded path like this (and this will only work on systems that support such a file path, so no Windows).
) | ||
self.cleanup_output_files(output_mlir, output_json, output_vmfb) | ||
|
||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be good to comment why we have the xfails for any newcomers not familiar with what isn't functional yet
c287409
to
680fb1e
Compare
680fb1e
to
3d13825
Compare
bb744f4
to
4cd38aa
Compare
f"--device=hip://{hip_device_id}", | ||
"--hip_use_streams=true", | ||
"--hip_allow_inline_execution=true", | ||
"--device_allocator=caching", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep an eye on these flags to make sure we're measuring something representative with what codegen is working on and what model serving will do. These flags change behavior fairly significantly.
Looks like artifacts are not being uploaded as part of the latest run? Where are we dumping to |
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
161eaa5
to
c9fb66b
Compare
Signed-off-by: aviator19941 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, everything works!
Signed-off-by: aviator19941 <[email protected]>
Adds pytests for f16 and fp8 with the CI. Currently Llama 3.1 8B f16 is the only test that fully benchmarks through. Llama 3.1 8B fp8, Llama 3.1 70B f16/fp8, and Llama 3.1 405B f16/fp8 tests are marked as XFAIL for now.