Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Llama 3.1 f16 and fp8 with CI #284

Merged
merged 25 commits into from
Oct 22, 2024
Merged

Conversation

aviator19941
Copy link
Collaborator

@aviator19941 aviator19941 commented Oct 16, 2024

Adds pytests for f16 and fp8 with the CI. Currently Llama 3.1 8B f16 is the only test that fully benchmarks through. Llama 3.1 8B fp8, Llama 3.1 70B f16/fp8, and Llama 3.1 405B f16/fp8 tests are marked as XFAIL for now.

artifacts_dir = "/data/extra/models/llama3.1_8B/"
self.irpa_path = artifacts_dir + "llama8b_f16.irpa"
self.irpa_path_fp8 = artifacts_dir + "llama8b_fp8.irpa"
self.output_mlir = self.repo_root + "llama8b_f16.mlir"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output files should be temporary. We don't want them lasting or being picked up accidentally on subsequent runs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a function to remove the output files at the end of each test

Comment on lines 27 to 43
export_args = [
"python3",
"-m",
"sharktank.examples.export_paged_llm_v1",
"--irpa-file",
irpa_path,
"--output-mlir",
output_mlir_path,
"--output-config",
output_json_path,
]
if attention_kernel in ["decomposed", "torch_sdpa"]:
export_args.append("--attention-kernel")
export_args.append(attention_kernel)

cmd = subprocess.list2cmdline(export_args)
return cmd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be separate from this PR, but we should use an in-memory API for all model exporting/importing workflows, rather than fork into a subprocess.

Users shouldn't need to chain together python -m sharktank.examples. and iree-compile ... commands. We can aim for something like https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference

llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)

(that's as minimal as it gets - we'll want to pass options like the compilation target though)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll try to update it in the next PR

@archana-ramalingam
Copy link
Collaborator

Updating the branch with changes from main will fix the second CI test.

if return_code != 0:
raise Exception(f"{cmd} failed to run")

def cleanup_output_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to use tempfile like done here https://github.com/nod-ai/SHARK-Platform/blob/0c2e965c3ffe723db2fe2be9193c6d45fe558dbe/sharktank/tests/types/tensors_test.py#L71. Should better handle potential multiple async runs effectively corrupting the data in these effectively hardcoded paths

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to save the output files to track any regressions right? So, do we still need tempfile? I think os.mkdir() would make more sense to use in this case, especially when specifying the directory path of the artifiacts to upload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can upload the output artifacts before tempfile gets rid of them. I don't think we want to keep the files on the CI system permanently. os.mkdir could still have issues with multiple simultaneous instances running interfering with each other unless you are very careful with it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I upload the output artifacts before tempfile gets rid of them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/nod-ai/SHARK-TestSuite/blob/95b35cc076618ef79b7f84dc4117f4528451e5e6/e2eshark/_run_helper.py#L61 for example uploading files. You should just need to do that before closing the tempfile or directory that is created

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we prefer sharkpublic? We will be uploading a bunch of artifacts there every night, and it might be hard to correlate which run correlates to which files. With github actions, we can just click on a run and see it's respective output artifacts without cluttering the storage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does github actions work well for downloading things like vmfbs? If so, then that would work well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried uploading vmfbs before as artifacts, but we can check before merging

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much data are we talking about here? Cloud storage isn't cheap. Using GitHub artifacts with GitHub workflows is nice since it doesn't require separate privileges to maintain, but then you're limited by GitHub's retention policies.

If you want to keep historical data in an organized way, please design a database early. I've seen too many benchmarking systems without that structure and it makes analysis tricky.

For this PR itself, I'd try to keep the scope small and checkpoint the work, rather than get hung up on all these discussions. Fine to land and iterate a bit so things don't pile up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's start with github artifacts without a temp dir (it won't persist between runs anyways) and iterate from there

)
self.cleanup_output_files(output_mlir, output_json, output_vmfb)

@pytest.mark.skip(reason="TODO: Need to plumb through attention_kernel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why skip instead of xfail?

Copy link
Collaborator Author

@aviator19941 aviator19941 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the test is still able to run successfully since it is still able to export an IR and compile and benchmark fully through, however, it is not exporting the correct IR yet. The attention_kernel support needs to be added for it to execute as expected. If I mark the test as XFAIL, it will still output benchmark numbers for 8B f16 decomposed even though we want the non-decomposed numbers in this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should raise an error when calling the attention_kernel flag with torch_sdpa until that is plumbed through and then we can xfail here

self.iree_run_decode_args,
self.repo_root,
)
self.cleanup_output_files(output_mlir, output_json, output_vmfb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To more easily handle regressions and such I think we really want to upload and store the model IR, vmfb, and dispatch dump so we can triage exactly what this benchmark used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I will save the temporary files into the github actions file then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was kind of thinking of just having it upload to sharkpublic since that would be easy to access for those who want them. Not sure if github actions file would be better or not

def setUp(self):
# TODO: add numpy files to Azure and download from it
self.repo_root = os.getenv("SHARK_PLATFORM_REPO_ROOT")
artifacts_dir = "/data/extra/models/llama3.1_8B/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likely need at least a comment that this is system specific for A30F. Anybody trying to replicate locally will need to know how to recreate. Better would probably be adding an arg to pass this path that defaults to this for the runner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General workflow tip for tests / CI: start by making tests runnable by a developer, then teach a CI machine how to follow those steps. We should never have a hardcoded path like this (and this will only work on systems that support such a file path, so no Windows).

)
self.cleanup_output_files(output_mlir, output_json, output_vmfb)

@pytest.mark.xfail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be good to comment why we have the xfails for any newcomers not familiar with what isn't functional yet

@aviator19941 aviator19941 force-pushed the benchmark-llama-fp8-ci branch 2 times, most recently from c287409 to 680fb1e Compare October 21, 2024 17:58
@aviator19941 aviator19941 force-pushed the benchmark-llama-fp8-ci branch from 680fb1e to 3d13825 Compare October 21, 2024 21:00
@aviator19941 aviator19941 force-pushed the benchmark-llama-fp8-ci branch from bb744f4 to 4cd38aa Compare October 22, 2024 00:13
Comment on lines 164 to 167
f"--device=hip://{hip_device_id}",
"--hip_use_streams=true",
"--hip_allow_inline_execution=true",
"--device_allocator=caching",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep an eye on these flags to make sure we're measuring something representative with what codegen is working on and what model serving will do. These flags change behavior fairly significantly.

@saienduri
Copy link
Contributor

saienduri commented Oct 22, 2024

Looks like artifacts are not being uploaded as part of the latest run? Where are we dumping to ${{ github.workspace }}/${{ steps.date.outputs.date }}?

Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
Signed-off-by: aviator19941 <[email protected]>
@aviator19941 aviator19941 force-pushed the benchmark-llama-fp8-ci branch from 161eaa5 to c9fb66b Compare October 22, 2024 21:54
Signed-off-by: aviator19941 <[email protected]>
Copy link
Contributor

@saienduri saienduri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, everything works!

Signed-off-by: aviator19941 <[email protected]>
@aviator19941 aviator19941 merged commit 16be365 into main Oct 22, 2024
3 checks passed
@aviator19941 aviator19941 deleted the benchmark-llama-fp8-ci branch October 22, 2024 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants