Benchmark Llama 3.1 f16 and fp8 with CI #284

aviator19941 · 2024-10-16T16:01:39Z

Adds pytests for f16 and fp8 with the CI. Currently Llama 3.1 8B f16 is the only test that fully benchmarks through. Llama 3.1 8B fp8, Llama 3.1 70B f16/fp8, and Llama 3.1 405B f16/fp8 tests are marked as XFAIL for now.

.github/workflows/ci-llama.yaml

sharktank/sharktank/examples/export_paged_llm_v1.py

sharktank/tests/models/llama/benchmark-tests.py

.github/workflows/ci-llama.yaml

IanNod · 2024-10-16T18:07:12Z

sharktank/tests/models/llama/benchmark-tests.py

+        artifacts_dir = "/data/extra/models/llama3.1_8B/"
+        self.irpa_path = artifacts_dir + "llama8b_f16.irpa"
+        self.irpa_path_fp8 = artifacts_dir + "llama8b_fp8.irpa"
+        self.output_mlir = self.repo_root + "llama8b_f16.mlir"


output files should be temporary. We don't want them lasting or being picked up accidentally on subsequent runs

Added a function to remove the output files at the end of each test

sharktank/tests/models/llama/benchmark-tests.py

ScottTodd · 2024-10-16T18:30:38Z

sharktank/tests/models/llama/benchmark-tests.py

+        export_args = [
+            "python3",
+            "-m",
+            "sharktank.examples.export_paged_llm_v1",
+            "--irpa-file",
+            irpa_path,
+            "--output-mlir",
+            output_mlir_path,
+            "--output-config",
+            output_json_path,
+        ]
+        if attention_kernel in ["decomposed", "torch_sdpa"]:
+            export_args.append("--attention-kernel")
+            export_args.append(attention_kernel)
+
+        cmd = subprocess.list2cmdline(export_args)
+        return cmd


This can be separate from this PR, but we should use an in-memory API for all model exporting/importing workflows, rather than fork into a subprocess.

Users shouldn't need to chain together python -m sharktank.examples. and iree-compile ... commands. We can aim for something like https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference

llm = LLM(model="facebook/opt-125m") outputs = llm.generate(prompts, sampling_params)

(that's as minimal as it gets - we'll want to pass options like the compilation target though)

Makes sense, I'll try to update it in the next PR

sharktank/tests/models/llama/benchmark-tests.py

archana-ramalingam · 2024-10-16T19:28:10Z

Updating the branch with changes from main will fix the second CI test.

IanNod · 2024-10-17T15:51:16Z

sharktank/tests/models/llama/benchmark_tests.py

+        if return_code != 0:
+            raise Exception(f"{cmd} failed to run")
+
+    def cleanup_output_files(


probably better to use tempfile like done here https://github.com/nod-ai/SHARK-Platform/blob/0c2e965c3ffe723db2fe2be9193c6d45fe558dbe/sharktank/tests/types/tensors_test.py#L71. Should better handle potential multiple async runs effectively corrupting the data in these effectively hardcoded paths

We want to save the output files to track any regressions right? So, do we still need tempfile? I think os.mkdir() would make more sense to use in this case, especially when specifying the directory path of the artifiacts to upload.

you can upload the output artifacts before tempfile gets rid of them. I don't think we want to keep the files on the CI system permanently. os.mkdir could still have issues with multiple simultaneous instances running interfering with each other unless you are very careful with it

How can I upload the output artifacts before tempfile gets rid of them?

https://github.com/nod-ai/SHARK-TestSuite/blob/95b35cc076618ef79b7f84dc4117f4528451e5e6/e2eshark/_run_helper.py#L61 for example uploading files. You should just need to do that before closing the tempfile or directory that is created

Do we prefer sharkpublic? We will be uploading a bunch of artifacts there every night, and it might be hard to correlate which run correlates to which files. With github actions, we can just click on a run and see it's respective output artifacts without cluttering the storage

Does github actions work well for downloading things like vmfbs? If so, then that would work well.

I haven't tried uploading vmfbs before as artifacts, but we can check before merging

How much data are we talking about here? Cloud storage isn't cheap. Using GitHub artifacts with GitHub workflows is nice since it doesn't require separate privileges to maintain, but then you're limited by GitHub's retention policies.

If you want to keep historical data in an organized way, please design a database early. I've seen too many benchmarking systems without that structure and it makes analysis tricky.

For this PR itself, I'd try to keep the scope small and checkpoint the work, rather than get hung up on all these discussions. Fine to land and iterate a bit so things don't pile up.

Yeah let's start with github artifacts without a temp dir (it won't persist between runs anyways) and iterate from there

IanNod · 2024-10-17T15:53:05Z

sharktank/tests/models/llama/benchmark_tests.py

+        )
+        self.cleanup_output_files(output_mlir, output_json, output_vmfb)
+
+    @pytest.mark.skip(reason="TODO: Need to plumb through attention_kernel")


why skip instead of xfail?

Because the test is still able to run successfully since it is still able to export an IR and compile and benchmark fully through, however, it is not exporting the correct IR yet. The attention_kernel support needs to be added for it to execute as expected. If I mark the test as XFAIL, it will still output benchmark numbers for 8B f16 decomposed even though we want the non-decomposed numbers in this test.

We should raise an error when calling the attention_kernel flag with torch_sdpa until that is plumbed through and then we can xfail here

IanNod · 2024-10-17T15:55:21Z

sharktank/tests/models/llama/benchmark_tests.py

+            self.iree_run_decode_args,
+            self.repo_root,
+        )
+        self.cleanup_output_files(output_mlir, output_json, output_vmfb)


To more easily handle regressions and such I think we really want to upload and store the model IR, vmfb, and dispatch dump so we can triage exactly what this benchmark used.

Makes sense, I will save the temporary files into the github actions file then.

I was kind of thinking of just having it upload to sharkpublic since that would be easy to access for those who want them. Not sure if github actions file would be better or not

IanNod · 2024-10-17T15:56:56Z

sharktank/tests/models/llama/benchmark_tests.py

+    def setUp(self):
+        # TODO: add numpy files to Azure and download from it
+        self.repo_root = os.getenv("SHARK_PLATFORM_REPO_ROOT")
+        artifacts_dir = "/data/extra/models/llama3.1_8B/"


likely need at least a comment that this is system specific for A30F. Anybody trying to replicate locally will need to know how to recreate. Better would probably be adding an arg to pass this path that defaults to this for the runner

General workflow tip for tests / CI: start by making tests runnable by a developer, then teach a CI machine how to follow those steps. We should never have a hardcoded path like this (and this will only work on systems that support such a file path, so no Windows).

IanNod · 2024-10-17T15:58:25Z

sharktank/tests/models/llama/benchmark_tests.py

+        )
+        self.cleanup_output_files(output_mlir, output_json, output_vmfb)
+
+    @pytest.mark.xfail


nit: would be good to comment why we have the xfails for any newcomers not familiar with what isn't functional yet

.github/workflows/ci-llama.yaml

sharktank/tests/models/llama/benchmark_tests.py

ScottTodd · 2024-10-22T00:58:31Z

sharktank/tests/models/llama/benchmark_tests.py

+            f"--device=hip://{hip_device_id}",
+            "--hip_use_streams=true",
+            "--hip_allow_inline_execution=true",
+            "--device_allocator=caching",


Let's keep an eye on these flags to make sure we're measuring something representative with what codegen is working on and what model serving will do. These flags change behavior fairly significantly.

sharktank/tests/models/llama/benchmark_tests.py

saienduri · 2024-10-22T01:37:44Z

Looks like artifacts are not being uploaded as part of the latest run? Where are we dumping to ${{ github.workspace }}/${{ steps.date.outputs.date }}?

Signed-off-by: aviator19941 <[email protected]>

saienduri

Nice, everything works!

Signed-off-by: aviator19941 <[email protected]>

aviator19941 requested review from saienduri, IanNod, rsuderman and archana-ramalingam October 16, 2024 16:03

saienduri reviewed Oct 16, 2024

View reviewed changes

.github/workflows/ci-llama.yaml Show resolved Hide resolved

saienduri reviewed Oct 16, 2024

View reviewed changes

.github/workflows/ci-llama.yaml Outdated Show resolved Hide resolved

saienduri reviewed Oct 16, 2024

View reviewed changes

.github/workflows/ci-llama.yaml Outdated Show resolved Hide resolved

saienduri reviewed Oct 16, 2024

View reviewed changes

sharktank/sharktank/examples/export_paged_llm_v1.py Show resolved Hide resolved

saienduri reviewed Oct 16, 2024

View reviewed changes

sharktank/tests/models/llama/benchmark-tests.py Outdated Show resolved Hide resolved

saienduri reviewed Oct 16, 2024

View reviewed changes

.github/workflows/ci-llama.yaml Outdated Show resolved Hide resolved

IanNod reviewed Oct 16, 2024

View reviewed changes

ScottTodd reviewed Oct 16, 2024

View reviewed changes

aviator19941 force-pushed the benchmark-llama-fp8-ci branch from 088bde7 to 23e23ee Compare October 17, 2024 15:30

aviator19941 requested review from saienduri and IanNod October 17, 2024 15:43

IanNod reviewed Oct 17, 2024

View reviewed changes

aviator19941 force-pushed the benchmark-llama-fp8-ci branch 2 times, most recently from c287409 to 680fb1e Compare October 21, 2024 17:58

aviator19941 requested review from ScottTodd and IanNod October 21, 2024 17:59

aviator19941 force-pushed the benchmark-llama-fp8-ci branch from 680fb1e to 3d13825 Compare October 21, 2024 21:00

archana-ramalingam reviewed Oct 21, 2024

View reviewed changes

.github/workflows/ci-llama.yaml Outdated Show resolved Hide resolved

aviator19941 requested a review from archana-ramalingam October 21, 2024 21:32

archana-ramalingam approved these changes Oct 21, 2024

View reviewed changes

aviator19941 force-pushed the benchmark-llama-fp8-ci branch from bb744f4 to 4cd38aa Compare October 22, 2024 00:13

ScottTodd reviewed Oct 22, 2024

View reviewed changes

aviator19941 added 2 commits October 22, 2024 16:54

Add yaml file and pytest file

3ffe501

Signed-off-by: aviator19941 <[email protected]>

export IR and compile IR w/ decomposed

0368612

Signed-off-by: aviator19941 <[email protected]>

aviator19941 added 21 commits October 22, 2024 16:54

Add iree-benchmark-module for prefill

71f902f

Signed-off-by: aviator19941 <[email protected]>

Fix iree-benchmark-module HSA memory access error from inputs

ee4e1fd

Signed-off-by: aviator19941 <[email protected]>

Use self.next_tokens for decode instead of self.token_ids

cb2d049

Signed-off-by: aviator19941 <[email protected]>

Fix formatting and rename 8b f16 test

2efda94

Signed-off-by: aviator19941 <[email protected]>

Modify self.repo_root to SHARK_PLATFORM_REPO_ROOT

ac42b3b

Signed-off-by: aviator19941 <[email protected]>

Add f16 non-decomposed and fp8 decomposed and non-decomposed tests

3264ecd

Signed-off-by: aviator19941 <[email protected]>

Update CI to store hal executable files artifacts and fix some comments

711aef0

Signed-off-by: aviator19941 <[email protected]>

Add TP1 70B and TP8 405B tests, fix cron schedule

35fded8

Signed-off-by: aviator19941 <[email protected]>

Fix cleanup function

0ecf40f

Signed-off-by: aviator19941 <[email protected]>

Fix repo_root path

c3b6eae

Signed-off-by: aviator19941 <[email protected]>

Test CI on pull request temporarily

2fe286b

Signed-off-by: aviator19941 <[email protected]>

[WIP] Use temp files and add keyword args

daf3803

Signed-off-by: aviator19941 <[email protected]>

Fix comments and use a single directory for test output files

1d26f08

Signed-off-by: aviator19941 <[email protected]>

Fix conflict, remove on pull_request, fix repo_root path

651110a

Signed-off-by: aviator19941 <[email protected]>

Remove fixture

7f6b441

Signed-off-by: aviator19941 <[email protected]>

Fix formatting

fd6370b

Signed-off-by: aviator19941 <[email protected]>

Update requirements so perplexity tests don't cause CI to fail

d72090e

Signed-off-by: aviator19941 <[email protected]>

Add on pull_request to see if it passes

bedf01e

Signed-off-by: aviator19941 <[email protected]>

Add current date to yaml file to upload directory of current date

04705b4

Signed-off-by: aviator19941 <[email protected]>

Fix comments

57b9168

Signed-off-by: aviator19941 <[email protected]>

Fix comments

c9fb66b

Signed-off-by: aviator19941 <[email protected]>

aviator19941 force-pushed the benchmark-llama-fp8-ci branch from 161eaa5 to c9fb66b Compare October 22, 2024 21:54

Fix pytest path

ad644f8

Signed-off-by: aviator19941 <[email protected]>

saienduri approved these changes Oct 22, 2024

View reviewed changes

Remove on pull_request

728773d

Signed-off-by: aviator19941 <[email protected]>

aviator19941 merged commit 16be365 into main Oct 22, 2024
3 checks passed

aviator19941 deleted the benchmark-llama-fp8-ci branch October 22, 2024 22:10

ScottTodd mentioned this pull request Oct 30, 2024

CPU LLM Integration Test #373

Merged

stbaione mentioned this pull request Oct 31, 2024

Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server #402

Open

saienduri mentioned this pull request Nov 4, 2024

llama reporting tracking nod-ai/SHARK-TestSuite#383

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Llama 3.1 f16 and fp8 with CI #284

Benchmark Llama 3.1 f16 and fp8 with CI #284

aviator19941 commented Oct 16, 2024 •

edited

Loading

IanNod Oct 16, 2024

aviator19941 Oct 17, 2024

ScottTodd Oct 16, 2024

aviator19941 Oct 16, 2024

archana-ramalingam commented Oct 16, 2024

IanNod Oct 17, 2024

aviator19941 Oct 18, 2024

IanNod Oct 18, 2024

aviator19941 Oct 18, 2024

IanNod Oct 18, 2024

saienduri Oct 18, 2024

IanNod Oct 18, 2024

saienduri Oct 18, 2024

ScottTodd Oct 18, 2024

saienduri Oct 18, 2024

IanNod Oct 17, 2024

aviator19941 Oct 17, 2024 •

edited

Loading

IanNod Oct 17, 2024

IanNod Oct 17, 2024

aviator19941 Oct 17, 2024

IanNod Oct 18, 2024

IanNod Oct 17, 2024

ScottTodd Oct 17, 2024

IanNod Oct 17, 2024

ScottTodd Oct 22, 2024

saienduri commented Oct 22, 2024 •

edited

Loading

saienduri left a comment

Benchmark Llama 3.1 f16 and fp8 with CI #284

Benchmark Llama 3.1 f16 and fp8 with CI #284

Conversation

aviator19941 commented Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

archana-ramalingam commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aviator19941 Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saienduri commented Oct 22, 2024 • edited Loading

saienduri left a comment

Choose a reason for hiding this comment

aviator19941 commented Oct 16, 2024 •

edited

Loading

aviator19941 Oct 17, 2024 •

edited

Loading

saienduri commented Oct 22, 2024 •

edited

Loading