[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

a-sidorova · 2024-12-23T06:55:47Z

Details:

Added two separate implementation for external repacking: in parallel section with kernel and in the separate parallel section before kernel execution

Tickets:

159886

TODO:

Adjust heuristic of the impl choosing
Add layout support
Merge the [Snippets] Disabled non-inplace ops tokenization on 2nd inputs of MatMuls in MHA a-sidorova/openvino#266 to this branch
Merge the [Snippets] SplitDimensionM: heuristic update a-sidorova/openvino#267 to this branch

IvanNovoselov

The first part

IvanNovoselov · 2024-12-24T15:27:24Z

src/plugins/intel_cpu/src/nodes/subgraph.h

@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor {
    inline void segfault_detector();
 #endif

-private:
-    std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs);
+#ifdef OPENVINO_ARCH_X86_64


Long #ifdef blocks significantly reduce readability. Do you think we can create different executors of x86-64 and arm?
Note that we can use variadic templates + perfect forwarding to avoid repeating these long argument lists in constructors.

I thought about the separate executors during developing.
Firstly, I believe that the current impl is temporary solution and nothing arch-specific should be in executors. So if we implement the separate executors, to after some time (probably, not so soon) we will should to unite them again.
Secondly, we have two separate executors for dynamic and static shapes. If we want to split executors by arch, will be there 4 executors (2 arch x 2 shape types)?

I'd say that this is open question. And I'm glad to discuss it offline 😃

I have an alternative proposal to avoid #ifdefs: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.

Splited into separate executors.
As for CPURuntimeConfig - will be implemented after holidays

src/plugins/intel_cpu/src/nodes/subgraph.cpp

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp

src/common/snippets/include/snippets/op/reshape.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp

src/plugins/intel_cpu/src/nodes/subgraph.cpp

v-Golubev · 2024-12-29T15:33:46Z

src/plugins/intel_cpu/src/nodes/subgraph.h

@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor {
    inline void segfault_detector();
 #endif

-private:
-    std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs);
+#ifdef OPENVINO_ARCH_X86_64


I have an alternative proposal to avoid #ifdefs: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.

IvanNovoselov

Second part

src/common/snippets/include/snippets/op/reshape.hpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

IvanNovoselov · 2024-12-30T12:36:15Z

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

 };

 class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator {
 public:
-    CPURuntimeConfigurator();
+    CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {});


Minor: do we need a default argument here? Wouldn't it be safer to force user to always provide a cache pointer?

I made so because of ARM- Arm doesn't have cache in CPUTargetMachine as x64 has it.
As for safety, I think that default argument should not be here - it's not safe for x64 at least where we use this cache.
So I removed default arg and added ctor arg cache to aarch64.
Thanks!

Default argument is still there, please take a look

Looks like I updated the corresponding code on aarch64 but forgot to remove default arg 😄
Thanks for the reminder!

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

IvanNovoselov · 2024-12-30T13:00:00Z

src/common/snippets/src/runtime_configurator.cpp

+            const auto& reordered_reshape_it = std::find_if(shape_infer_seq.cbegin(), shape_infer_seq.cend(),
+                                                            [](const ExpressionPtr& expr) {
+                                                               return ov::is_type<op::ReshapeWithOrder>(expr->get_node());
+                                                            });
+            if (reordered_reshape_it != shape_infer_seq.cend()) {
+                const auto& reshape = *reordered_reshape_it;
+                const auto& etype = reshape->get_node()->get_output_element_type(0);
+                update_io_parameters(reshape->get_input_port_descriptor(0), etype);
+                continue;
+            }


Why do we check only ReshapeWithOrder here? Ideally, we should process all shape infer ops uniformly here. Can we process all input port descriptors in shape infer sequence? Can also skip the ones with planar layout, if needed.
This would be much more generic and will automatically work for future shape-infer ops.
It will also work is there will be more than one ReshapeWithOrder in a sequence.

I'd say that Reorder (ReshapeWithOrder) is not just reshaping. Also, this is fake op which were extracted. But we need to take descriptor not from shape infer op consumer (brgemm for example), we should take desc of Reorder to correctly collect layout and shape for further data offset calculation.

If we take a look at Eltwise Subgraph with blocked shapes, there will be RankNormalization with 1 insertion to last dim.
We cannot take in desc of RankNormalization - data offset calculation will be incorrect.

I think we need to discuss it offline 🤔

Discussed offline: decided to create the ticket for the further discussion and investigation.
Left the comment with the ticket number

src/plugins/intel_cpu/src/nodes/executors/x64/subgraph.cpp

src/plugins/intel_cpu/src/nodes/executors/aarch64/subgraph.hpp

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp

src/common/snippets/src/op/reorder.cpp

src/common/snippets/src/runtime_configurator.cpp

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp

v-Golubev · 2025-01-02T14:21:42Z

src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp

 };

 class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator {
 public:
-    CPURuntimeConfigurator();
+    CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {});


Default argument is still there, please take a look

src/plugins/intel_cpu/src/nodes/executors/subgraph.hpp

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp

src/plugins/intel_cpu/src/nodes/subgraph.h

src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp

src/plugins/intel_cpu/src/emitters/snippets/repacked_input.hpp

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp

v-Golubev

LGTM 👍

[Snippets][CPU] Fixed build on non-x64 platforms [Snippets][CPU] Updated heuristic [Snippets][CPU] Added inplace-Transpose support [Snippets][CPU] Applied Ivan comments [Snippets][CPU] Fixed code style [Snippets][CPU] Fixed prim isa [Snippets][CPU] Small optimizations [Snippets][CPU] Fixed the build [Snippets][CPU] Applied Vladislav comments [Snippes][CPU] Created new Executors [Snippets][CPU] Applied Ivan comments 2 [Snippets][CPU] Applied Vladislav & Ivan comments 3

github-actions bot added the category: CPU OpenVINO CPU plugin label Dec 23, 2024

a-sidorova force-pushed the feature/snippets/external_repacking branch 6 times, most recently from f2523ef to b665659 Compare December 23, 2024 12:05

a-sidorova added this to the 2025.0 milestone Dec 23, 2024

a-sidorova marked this pull request as ready for review December 23, 2024 13:29

a-sidorova requested review from a team as code owners December 23, 2024 13:29

IvanNovoselov reviewed Dec 24, 2024

View reviewed changes

v-Golubev reviewed Dec 29, 2024

View reviewed changes

a-sidorova force-pushed the feature/snippets/external_repacking branch 5 times, most recently from 2ee36e1 to e1951cc Compare December 30, 2024 17:30

IvanNovoselov reviewed Dec 30, 2024

View reviewed changes

IvanNovoselov reviewed Dec 31, 2024

View reviewed changes

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp Outdated Show resolved Hide resolved

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp Outdated Show resolved Hide resolved

a-sidorova force-pushed the feature/snippets/external_repacking branch 5 times, most recently from b008e3d to 1908740 Compare January 2, 2025 11:22

a-sidorova requested review from v-Golubev and IvanNovoselov January 2, 2025 11:30

v-Golubev reviewed Jan 2, 2025

View reviewed changes

IvanNovoselov reviewed Jan 2, 2025

View reviewed changes

src/plugins/intel_cpu/src/emitters/snippets/repacked_input.hpp Outdated Show resolved Hide resolved

...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp Outdated Show resolved Hide resolved

a-sidorova requested a review from IvanNovoselov January 3, 2025 12:23

a-sidorova requested a review from v-Golubev January 3, 2025 12:23

a-sidorova assigned v-Golubev and IvanNovoselov Jan 3, 2025

v-Golubev approved these changes Jan 3, 2025

View reviewed changes

a-sidorova mentioned this pull request Jan 6, 2025

[Snippets] Disabled non-inplace ops tokenization on 2nd inputs of MatMuls in MHA a-sidorova/openvino#266

Open

a-sidorova force-pushed the feature/snippets/external_repacking branch from f104183 to 4bfdc7f Compare January 6, 2025 09:30

This was referenced Jan 6, 2025

[Snippets][CPU] Enabled dynamic INT8,BF16,FP16 MHA tokenization #28276

Draft

[Snippets] SplitDimensionM: heuristic update a-sidorova/openvino#267

Open

IvanNovoselov approved these changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

a-sidorova commented Dec 23, 2024 •

edited

Loading

IvanNovoselov left a comment

IvanNovoselov Dec 24, 2024

a-sidorova Dec 26, 2024

v-Golubev Dec 29, 2024

a-sidorova Dec 30, 2024

v-Golubev Dec 29, 2024

IvanNovoselov left a comment

IvanNovoselov Dec 30, 2024

a-sidorova Jan 2, 2025

v-Golubev Jan 2, 2025

a-sidorova Jan 3, 2025

IvanNovoselov Dec 30, 2024

a-sidorova Jan 2, 2025

a-sidorova Jan 3, 2025

v-Golubev Jan 2, 2025

v-Golubev left a comment

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

Are you sure you want to change the base?

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

Conversation

a-sidorova commented Dec 23, 2024 • edited Loading

Details:

Tickets:

TODO:

IvanNovoselov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IvanNovoselov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-Golubev left a comment

Choose a reason for hiding this comment

a-sidorova commented Dec 23, 2024 •

edited

Loading