[CPU][ARM]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang · 2025-01-08T06:59:48Z

Details:

Snippets MatMul via brgemm emitter and executor

Tickets:

CVS-151344

chenhu-wang · 2025-01-15T08:17:21Z

@a-sidorova, Could you please review as well, as you are reviewing #28229. The test cases passed on arm for snippets MatMul. Thank you!

a-sidorova · 2025-01-15T08:55:24Z

src/plugins/intel_cpu/src/nodes/subgraph.cpp

+#    define SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64(PASS_PLACE, PASS, ...) \
+        backend_passes.emplace_back(PassPosition(PASS_PLACE), std::make_shared<PASS>(__VA_ARGS__))


Just to improve readability and since we don't use SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64 yet, we can lave only SNIPPETS_REGISTER_PASS_RELATIVE_ARM64 definition here.

a-sidorova · 2025-01-15T09:02:10Z

src/plugins/intel_cpu/src/transformations/tpp/aarch64/pass/lowered/brgemm_tpp_blocking.cpp

Could you please elaborate why we need to have the separate aarch64-specific pass BrgemmTPPBlocking? This pass already exists and used on x64. Can we use one pass for the both arhs?

a-sidorova · 2025-01-15T09:08:09Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/jit_brgemm_emitter.cpp

+void jit_brgemm_emitter::emit_impl(const std::vector<size_t>& in, const std::vector<size_t>& out) const {
+    validate_arguments(in, out);
+    std::unordered_set<size_t> exclude = {};
+    store_context(exclude);


Please note that we will merge #27391 soon. This PR efficently provides efficient work with reg spills - we will able to spill only needed (live) registers

Just for information and to align with other our activities 😊

a-sidorova · 2025-01-15T09:18:35Z

src/plugins/intel_cpu/thirdparty/CMakeLists.txt

 if (ENABLE_SNIPPETS_LIBXSMM_TPP)
+    ov_add_compiler_flags(-Wno-missing-declarations)


Could you elaborate why you need to add this flag? Can we avoid it?

a-sidorova · 2025-01-15T09:20:07Z

cmake/features.cmake

@@ -52,7 +52,7 @@ ov_dependent_option (ENABLE_GPU_DEBUG_CAPS "enable GPU debug capabilities at run
 ov_dependent_option (ENABLE_CPU_DEBUG_CAPS "enable CPU debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS;ENABLE_INTEL_CPU" OFF)
 ov_dependent_option (ENABLE_SNIPPETS_DEBUG_CAPS "enable Snippets debug capabilities at runtime" ON "ENABLE_DEBUG_CAPS" OFF)

-ov_dependent_option (ENABLE_SNIPPETS_LIBXSMM_TPP "allow Snippets to use LIBXSMM Tensor Processing Primitives" OFF "ENABLE_INTEL_CPU AND X86_64" OFF)


There is also RISCV64, AArch32 etc. Can we add only supported archs to condition?

a-sidorova · 2025-01-15T09:26:43Z

src/plugins/intel_cpu/src/emitters/snippets/brgemm_base.cpp

+#ifndef OPENVINO_ARCH_X86_64
+    config.update(DIM_CAST(M), DIM_CAST(N), DIM_CAST(K), 0, 0, 0, beta);
+    return;
+#endif
+


I'd say that there should be common cross-arch base class with method init_runtime_params(M,N,K,LDA,LDB,LDC).
Then x64 dnnl executors update LDB and LDA if needed. aarch64 (tpp) should call update_config(...) with these parameters.

If we use ifdef in common code, I think this is a sign of problematic code and we should resolve it.

a-sidorova · 2025-01-15T09:27:49Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+size_t BrgemmKernelConfig::StaticParams::compute_hash(dnnl::impl::cpu::aarch64::cpu_isa_t aarch_isa) {
+    return hash_combine(0, aarch_isa);
+}


We don't use aarch_isa in this config/kernel/executor. Then this attribute should be missed in params

a-sidorova · 2025-01-15T09:37:59Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+    const auto num_ins = expr->get_node()->get_input_size();
+    const auto num_outs = expr->get_node()->get_output_size();
+
+    size_t io_strides[num_ins + num_outs];


Is it variable length array? If it is, as far as I know, ISO C++ forbids it. Can we use std::vector at least here or just 3 size_t variables?

a-sidorova · 2025-01-15T09:43:49Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.cpp

+    auto refreshed_compile_flag =
+        config.get_beta() == 0 ? config.get_compile_flags() | LIBXSMM_GEMM_FLAG_BETA_0 : compile_flag;


You already update compile flags on L121 in update_config. Then config.get_compile_flags() will return already updated flags

a-sidorova · 2025-01-15T09:47:40Z

src/plugins/intel_cpu/src/emitters/tpp/aarch64/kernel_executors/brgemm.hpp

+    std::shared_ptr<libxsmm_gemmfunction> brgemm_kernel = nullptr;
+};
+
+class BrgemmKernelExecutor : public BrgemmBaseKernelExecutor,


By the way, is there any differences of config/executor/kernel between aarch64?
Is there any chance to have only one kernel executor for x64 and aarc64? As far as I know, all aarch-dependent params are hidden in TPP functions/kernels (this is pro of libxsmm).

chenhu-wang requested review from a team as code owners January 8, 2025 06:59

github-actions bot added the category: CPU OpenVINO CPU plugin label Jan 8, 2025

chenhu-wang marked this pull request as draft January 8, 2025 07:28

github-actions bot added the category: build OpenVINO cmake script / infra label Jan 9, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 3 times, most recently from a5b829d to 6ca4f1b Compare January 9, 2025 08:09

chenhu-wang marked this pull request as ready for review January 13, 2025 06:37

chenhu-wang requested a review from a team as a code owner January 13, 2025 06:37

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch 15 times, most recently from 982e2c2 to 6e05cb1 Compare January 15, 2025 05:33

a-sidorova reviewed Jan 15, 2025

View reviewed changes

a-sidorova self-assigned this Jan 15, 2025

v-Golubev self-assigned this Jan 16, 2025

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 6e05cb1 to 96e274c Compare January 22, 2025 06:36

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 96e274c to 6b55e68 Compare January 22, 2025 08:07

chenhu-wang added 5 commits January 23, 2025 13:30

brgemm emitter and executor

eef7dab

executor cache

bc9cb7e

update arm passes, test enable

1a9e7f9

update cmake

eb9af9c

refactor tpp on x64 and aarch64

f336769

chenhu-wang force-pushed the chenhu/snipppets_matmul_via_executor_on_arm branch from 6b55e68 to f336769 Compare January 23, 2025 05:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU][ARM]Snippets MatMul via brgemm emitter and executor #28304

[CPU][ARM]Snippets MatMul via brgemm emitter and executor #28304

chenhu-wang commented Jan 8, 2025

chenhu-wang commented Jan 15, 2025 •

edited

Loading

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

a-sidorova Jan 15, 2025

		# define SNIPPETS_REGISTER_PASS_ABSOLUTE_ARM64(PASS_PLACE, PASS, ...) \
		backend_passes.emplace_back(PassPosition(PASS_PLACE), std::make_shared<PASS>(__VA_ARGS__))

		if (ENABLE_SNIPPETS_LIBXSMM_TPP)
		ov_add_compiler_flags(-Wno-missing-declarations)

		auto refreshed_compile_flag =
		config.get_beta() == 0 ? config.get_compile_flags() \| LIBXSMM_GEMM_FLAG_BETA_0 : compile_flag;

[CPU][ARM]Snippets MatMul via brgemm emitter and executor #28304

Are you sure you want to change the base?

[CPU][ARM]Snippets MatMul via brgemm emitter and executor #28304

Conversation

chenhu-wang commented Jan 8, 2025

Details:

Tickets:

chenhu-wang commented Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenhu-wang commented Jan 15, 2025 •

edited

Loading