Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Snippets][CPU] Added external repacking via BrgemmCopyB #28179

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

a-sidorova
Copy link
Contributor

@a-sidorova a-sidorova commented Dec 23, 2024

Details:

  • Added two separate implementation for external repacking: in parallel section with kernel and in the separate parallel section before kernel execution

Tickets:

  • 159886

TODO:

@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Dec 23, 2024
@a-sidorova a-sidorova force-pushed the feature/snippets/external_repacking branch 6 times, most recently from f2523ef to b665659 Compare December 23, 2024 12:05
@a-sidorova a-sidorova added this to the 2025.0 milestone Dec 23, 2024
@a-sidorova a-sidorova marked this pull request as ready for review December 23, 2024 13:29
@a-sidorova a-sidorova requested review from a team as code owners December 23, 2024 13:29
Copy link
Contributor

@IvanNovoselov IvanNovoselov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first part

@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor {
inline void segfault_detector();
#endif

private:
std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs);
#ifdef OPENVINO_ARCH_X86_64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long #ifdef blocks significantly reduce readability. Do you think we can create different executors of x86-64 and arm?
Note that we can use variadic templates + perfect forwarding to avoid repeating these long argument lists in constructors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the separate executors during developing.
Firstly, I believe that the current impl is temporary solution and nothing arch-specific should be in executors. So if we implement the separate executors, to after some time (probably, not so soon) we will should to unite them again.
Secondly, we have two separate executors for dynamic and static shapes. If we want to split executors by arch, will be there 4 executors (2 arch x 2 shape types)?

I'd say that this is open question. And I'm glad to discuss it offline 😃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternative proposal to avoid #ifdefs: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splited into separate executors.
As for CPURuntimeConfig - will be implemented after holidays

src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/common/snippets/include/snippets/op/reshape.hpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.cpp Outdated Show resolved Hide resolved
@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor {
inline void segfault_detector();
#endif

private:
std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs);
#ifdef OPENVINO_ARCH_X86_64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an alternative proposal to avoid #ifdefs: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.

@a-sidorova a-sidorova force-pushed the feature/snippets/external_repacking branch 5 times, most recently from 2ee36e1 to e1951cc Compare December 30, 2024 17:30
Copy link
Contributor

@IvanNovoselov IvanNovoselov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second part

src/common/snippets/include/snippets/op/reshape.hpp Outdated Show resolved Hide resolved
};

class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator {
public:
CPURuntimeConfigurator();
CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: do we need a default argument here? Wouldn't it be safer to force user to always provide a cache pointer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made so because of ARM- Arm doesn't have cache in CPUTargetMachine as x64 has it.
As for safety, I think that default argument should not be here - it's not safe for x64 at least where we use this cache.
So I removed default arg and added ctor arg cache to aarch64.
Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default argument is still there, please take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like I updated the corresponding code on aarch64 but forgot to remove default arg 😄
Thanks for the reminder!

Comment on lines 124 to 136
const auto& reordered_reshape_it = std::find_if(shape_infer_seq.cbegin(), shape_infer_seq.cend(),
[](const ExpressionPtr& expr) {
return ov::is_type<op::ReshapeWithOrder>(expr->get_node());
});
if (reordered_reshape_it != shape_infer_seq.cend()) {
const auto& reshape = *reordered_reshape_it;
const auto& etype = reshape->get_node()->get_output_element_type(0);
update_io_parameters(reshape->get_input_port_descriptor(0), etype);
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check only ReshapeWithOrder here? Ideally, we should process all shape infer ops uniformly here. Can we process all input port descriptors in shape infer sequence? Can also skip the ones with planar layout, if needed.
This would be much more generic and will automatically work for future shape-infer ops.
It will also work is there will be more than one ReshapeWithOrder in a sequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that Reorder (ReshapeWithOrder) is not just reshaping. Also, this is fake op which were extracted. But we need to take descriptor not from shape infer op consumer (brgemm for example), we should take desc of Reorder to correctly collect layout and shape for further data offset calculation.

If we take a look at Eltwise Subgraph with blocked shapes, there will be RankNormalization with 1 insertion to last dim.
We cannot take in desc of RankNormalization - data offset calculation will be incorrect.

I think we need to discuss it offline 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline: decided to create the ticket for the further discussion and investigation.
Left the comment with the ticket number

src/plugins/intel_cpu/src/nodes/executors/x64/subgraph.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/executors/x64/subgraph.cpp Outdated Show resolved Hide resolved
@a-sidorova a-sidorova force-pushed the feature/snippets/external_repacking branch 5 times, most recently from b008e3d to 1908740 Compare January 2, 2025 11:22
src/common/snippets/src/op/reorder.cpp Outdated Show resolved Hide resolved
src/common/snippets/src/runtime_configurator.cpp Outdated Show resolved Hide resolved
};

class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator {
public:
CPURuntimeConfigurator();
CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default argument is still there, please take a look

src/plugins/intel_cpu/src/nodes/executors/subgraph.hpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/subgraph.h Outdated Show resolved Hide resolved
Copy link
Contributor

@v-Golubev v-Golubev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

[Snippets][CPU] Fixed build on non-x64 platforms

[Snippets][CPU] Updated heuristic

[Snippets][CPU] Added inplace-Transpose support

[Snippets][CPU] Applied Ivan comments

[Snippets][CPU] Fixed code style

[Snippets][CPU] Fixed prim isa

[Snippets][CPU] Small optimizations

[Snippets][CPU] Fixed the build

[Snippets][CPU] Applied Vladislav comments

[Snippes][CPU] Created new Executors

[Snippets][CPU] Applied Ivan comments 2

[Snippets][CPU] Applied Vladislav & Ivan comments 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants