Introduce Graph Profiler #9659

max-krasnyansky · 2024-09-26T16:45:38Z

Here is an attempt at reintroducing the original whole-graph profiler (LLAMA_PERF) with some additional features.
Not ready for the merge into master but useful for profiling different models (on CPU).

Features:

Profiling data is collected per op / per thread and aggregated during the report
- We collect three per-op events: START, SYNC, END
- This allows for measuring sync overhead per op
- All timing data is in nanoseconds
The output is the markdown table
- See example below (nice and readable output both in the terminal and rendedered)
The output includes a compact dump of all tensor dims and data-types of the Op (see Flash Attention in the sample below)
- This might be handy in other places.
Changes to the graph_compute are minimal / clean
If GGML_GRAPH_PROFILER is not defined there is zero overhead (well, besides the extra pointer in the graph).
If profiler is compiled but not enabled the overhead is relatively small (3 extra if checks per graph node)
- We can probably optimize that out further (ie like a separate compute loop without profiling calls)
Tested on a couple platforms (M2 Max and Snapdragon Gen 3) with and without TSAN

Known issues:

Using env var is just a hack for now
- I'm thinking of adding ggml_init_param.graph_profile or it'll be moved into the backend params
- If set it'll enable profiling for all the graphs created from that context
The parms will probably be a string (filename:format or something like that)
The profile_data allocation is leaky
Currently it only works with the CPU backend
If there is interest it should be easy to extend to other backends where they could update per-node/per-thread
ggml_profile_timing data (they'd have to collect it on the accelerator and then export into this common format.

See original PR #9647 for additional details.

Example of the terminal output

| node idx | op name | proc (nsec) | sync (nsec) | total (nsec) | op dims | op types | tensor name |
| -------: | :------ | ----------: | ----------: | -----------: | ------: | -------: | ----------: |
| 0000 |   GET_ROWS |      22863 |   47566042 |   47588905 |      4096:128256:1:1 x 2:1:1:1 ->   4096:2:1:1 |      q4_0 x i32 -> f32 |             inp_embd |
| 0001 |   RMS_NORM |      96926 |   47859323 |   47956249 |                     4096:2:1:1 ->   4096:2:1:1 |             f32 -> f32 |               norm-0 |
| 0002 |        MUL |      11875 |   36215730 |   36227605 |        4096:2:1:1 x 4096:1:1:1 ->   4096:2:1:1 |       f32 x f32 -> f32 |          attn_norm-0 |
| 0003 |    MUL_MAT |   27633698 |   21271875 |   48905573 |     4096:4096:1:1 x 4096:2:1:1 ->   4096:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Qcur-0 |
| 0004 |    RESHAPE |       9167 |   22963435 |   22972602 |                     4096:2:1:1 ->   128:32:1:1 |             f32 -> f32 |    Qcur-0 (reshaped) |
| 0005 |       ROPE |    1787239 |   22215678 |   24002917 |           128:32:1:1 x 2:1:1:1 ->   128:32:1:1 |       f32 x i32 -> f32 |               Qcur-0 |
| 0006 |    MUL_MAT |   20612237 |     698388 |   21310625 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Kcur-0 |
| 0007 |    RESHAPE |      10989 |       7033 |      18022 |                     1024:2:1:1 ->    128:8:1:1 |             f32 -> f32 |    Kcur-0 (reshaped) |
| 0008 |       ROPE |    2566925 |     332969 |    2899894 |            128:8:1:1 x 2:1:1:1 ->    128:8:1:1 |       f32 x i32 -> f32 |               Kcur-0 |
| 0009 |    MUL_MAT |     489845 |      98229 |     588074 |     4096:1024:1:1 x 4096:2:1:1 ->   1024:2:1:1 |  q4_0_4x8 x f32 -> f32 |               Vcur-0 |
| 0010 |       VIEW |       7396 |       3074 |      10470 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       k_cache_view-0 |
| 0011 |        CPY |      15782 |       7186 |      22968 |         128:8:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | k_cache_view-0 (copy of Kcur-0) |
| 0012 |       VIEW |       5836 |       2444 |       8280 |                  4194304:1:1:1 ->   2048:1:1:1 |           q4_0 -> q4_0 |       v_cache_view-0 |
| 0013 |        CPY |      14324 |      12865 |      27189 |        1024:2:1:1 x 2048:1:1:1 ->   2048:1:1:1 |     f32 x q4_0 -> q4_0 | v_cache_view-0 (copy of Vcur-0) |
| 0014 |    PERMUTE |       5572 |       2240 |       7812 |                     128:32:1:1 ->    128:2:1:1 |             f32 -> f32 |                  q-0 |
| 0015 |       VIEW |      10469 |       3543 |      14012 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  k-0 |
| 0016 |       VIEW |       9375 |       3071 |      12446 |                  4194304:1:1:1 ->  128:256:1:1 |           q4_0 -> q4_0 |                  v-0 |
| 0017 |        CPY |      10053 |       4635 |      14688 |        256:32:1:1 x 256:32:1:1 ->   256:32:1:1 |       f32 x f16 -> f16 |       KQ_mask (copy) |
| 0018 | FLASH_ATTN_EXT |     113753 |      17809 |     131562 | 128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 ->   128:32:1:1 | f32 x q4_0 x q4_0 x f16 -> f32 |              node_18 |

Same example in rendered MarkDown

node idx	op name	proc (nsec)	sync (nsec)	total (nsec)	op dims	op types	tensor name
0000	GET_ROWS	22863	47566042	47588905	4096:128256:1:1 x 2:1:1:1 -> 4096:2:1:1	q4_0 x i32 -> f32	inp_embd
0001	RMS_NORM	96926	47859323	47956249	4096:2:1:1 -> 4096:2:1:1	f32 -> f32	norm-0
0002	MUL	11875	36215730	36227605	4096:2:1:1 x 4096:1:1:1 -> 4096:2:1:1	f32 x f32 -> f32	attn_norm-0
0003	MUL_MAT	27633698	21271875	48905573	4096:4096:1:1 x 4096:2:1:1 -> 4096:2:1:1	q4_0_4x8 x f32 -> f32	Qcur-0
0004	RESHAPE	9167	22963435	22972602	4096:2:1:1 -> 128:32:1:1	f32 -> f32	Qcur-0 (reshaped)
0005	ROPE	1787239	22215678	24002917	128:32:1:1 x 2:1:1:1 -> 128:32:1:1	f32 x i32 -> f32	Qcur-0
0006	MUL_MAT	20612237	698388	21310625	4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1	q4_0_4x8 x f32 -> f32	Kcur-0
0007	RESHAPE	10989	7033	18022	1024:2:1:1 -> 128:8:1:1	f32 -> f32	Kcur-0 (reshaped)
0008	ROPE	`2566925`	332969	2899894	128:8:1:1 x 2:1:1:1 -> 128:8:1:1	f32 x i32 -> f32	Kcur-0
0009	MUL_MAT	489845	98229	588074	4096:1024:1:1 x 4096:2:1:1 -> 1024:2:1:1	q4_0_4x8 x f32 -> f32	Vcur-0
0010	VIEW	7396	3074	10470	4194304:1:1:1 -> 2048:1:1:1	q4_0 -> q4_0	k_cache_view-0
0011	CPY	15782	7186	22968	128:8:1:1 x 2048:1:1:1 -> 2048:1:1:1	f32 x q4_0 -> q4_0	k_cache_view-0 (copy of Kcur-0)
0012	VIEW	5836	2444	8280	4194304:1:1:1 -> 2048:1:1:1	q4_0 -> q4_0	v_cache_view-0
0013	CPY	14324	12865	27189	1024:2:1:1 x 2048:1:1:1 -> 2048:1:1:1	f32 x q4_0 -> q4_0	v_cache_view-0 (copy of Vcur-0)
0014	PERMUTE	5572	2240	7812	128:32:1:1 -> 128:2:1:1	f32 -> f32	q-0
0015	VIEW	10469	3543	14012	4194304:1:1:1 -> 128:256:1:1	q4_0 -> q4_0	k-0
0016	VIEW	9375	3071	12446	4194304:1:1:1 -> 128:256:1:1	q4_0 -> q4_0	v-0
0017	CPY	10053	4635	14688	256:32:1:1 x 256:32:1:1 -> 256:32:1:1	f32 x f16 -> f16	KQ_mask (copy)
0018	FLASH_ATTN_EXT	113753	17809	131562	128:2:1:1 x 128:256:1:1 x 128:256:1:1 x 256:32:1:1 -> 128:32:1:1	f32 x q4_0 x q4_0 x f16 -> f32	node_18

shwu-nyunai · 2024-10-11T10:25:09Z

@max-krasnyansky I am using the graph-profiler branch but I'm unsure how to trigger and get the profiling details. Any docs, commands or references would be appreciated. Thanks.

max-krasnyansky · 2024-10-16T02:36:48Z

@max-krasnyansky I am using the graph-profiler branch but I'm unsure how to trigger and get the profiling details. Any docs, commands or references would be appreciated. Thanks.

Sorry for the delay.
I thought the example is included in the PR description but looks like I forgot that.

Here is how to build (arm64-ubuntu)

CC=clang-18 CXX=clang++-18 CFLAGS="-march=armv8.7a" CXXFLAGS="-march=armv8.7a" cmake -G Ninja -D GGML_GRAPH_PROFILER=ON -B build-arm64

cmake --build build-arm64

And here is how to run

GGML_GRAPH_PROFILE=stderr ./build-arm64/bin/llama-cli -m ../gguf/llama-v3.1-8b-instruct.q4_0_4_8.gguf -p Hi -n 1

This will get you the output I included in the PR

ggml-profile:| node idx | op name | proc (nsec) | sync (nsec) | total (nsec) | op dims | op types | tensor name |
ggml-profile:| -------: | :------ | ----------: | ----------: | -----------: | ------: | -------: | ----------: |
ggml-profile:| 0000 |   GET_ROWS |    6269425 |  135449736 |  141719161 |      4096:128256:1:1 x 2:1:1:1 ->   4096:2:1:1 |      q4_0 x i32 -> f32 |             inp_embd |
ggml-profile:| 0001 |   RMS_NORM |    1038728 |  154384695 |  155423423 |                     4096:2:1:1 ->   4096:2:1:1 |             f32 -> f32 |               norm-0 |

max-krasnyansky mentioned this pull request Sep 26, 2024

Resurrect Graph & Op Profiler #9647

Closed

4 tasks

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Sep 26, 2024

max-krasnyansky force-pushed the graph-profiler branch from 6246824 to e7e9a7f Compare October 16, 2024 02:33

max-krasnyansky force-pushed the graph-profiler branch from e7e9a7f to d4051c8 Compare December 24, 2024 23:38

profiler: initial support for profiling graph ops

a362c74

max-krasnyansky force-pushed the graph-profiler branch from d4051c8 to a362c74 Compare December 27, 2024 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Graph Profiler #9659

Introduce Graph Profiler #9659

max-krasnyansky commented Sep 26, 2024

shwu-nyunai commented Oct 11, 2024

max-krasnyansky commented Oct 16, 2024

Introduce Graph Profiler #9659

Are you sure you want to change the base?

Introduce Graph Profiler #9659

Conversation

max-krasnyansky commented Sep 26, 2024

shwu-nyunai commented Oct 11, 2024

max-krasnyansky commented Oct 16, 2024