-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Snippets][CPU] Added external repacking via BrgemmCopyB #28179
base: master
Are you sure you want to change the base?
[Snippets][CPU] Added external repacking via BrgemmCopyB #28179
Conversation
f2523ef
to
b665659
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first part
@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor { | |||
inline void segfault_detector(); | |||
#endif | |||
|
|||
private: | |||
std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs); | |||
#ifdef OPENVINO_ARCH_X86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long #ifdef
blocks significantly reduce readability. Do you think we can create different executors of x86-64 and arm?
Note that we can use variadic templates + perfect forwarding to avoid repeating these long argument lists in constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about the separate executors during developing.
Firstly, I believe that the current impl is temporary solution and nothing arch-specific should be in executors. So if we implement the separate executors, to after some time (probably, not so soon) we will should to unite them again.
Secondly, we have two separate executors for dynamic and static shapes. If we want to split executors by arch, will be there 4 executors (2 arch x 2 shape types)?
I'd say that this is open question. And I'm glad to discuss it offline 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an alternative proposal to avoid #ifdef
s: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE
.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splited into separate executors.
As for CPURuntimeConfig
- will be implemented after holidays
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Outdated
Show resolved
Hide resolved
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Show resolved
Hide resolved
@@ -172,10 +173,48 @@ class Subgraph::SubgraphExecutor { | |||
inline void segfault_detector(); | |||
#endif | |||
|
|||
private: | |||
std::vector<MemoryPtr> reorder_inputs(const dnnl::stream& strm, const std::vector<MemoryPtr>& inMemPtrs); | |||
#ifdef OPENVINO_ARCH_X86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an alternative proposal to avoid #ifdef
s: maybe we can introduce this code for all platforms? But for non-x86-64 ones we could place one assert which checks that m_repacking_impl_type == CPURuntimeConfig::RepackingImplType::NONE
.
If we decided to implement this feature for other platforms, almost all the current functional would be reused anyway.
2ee36e1
to
e1951cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second part
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Outdated
Show resolved
Hide resolved
}; | ||
|
||
class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator { | ||
public: | ||
CPURuntimeConfigurator(); | ||
CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: do we need a default argument here? Wouldn't it be safer to force user to always provide a cache pointer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made so because of ARM- Arm doesn't have cache
in CPUTargetMachine
as x64 has it.
As for safety, I think that default argument should not be here - it's not safe for x64 at least where we use this cache.
So I removed default arg and added ctor arg cache
to aarch64.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default argument is still there, please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I updated the corresponding code on aarch64 but forgot to remove default arg 😄
Thanks for the reminder!
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.hpp
Show resolved
Hide resolved
const auto& reordered_reshape_it = std::find_if(shape_infer_seq.cbegin(), shape_infer_seq.cend(), | ||
[](const ExpressionPtr& expr) { | ||
return ov::is_type<op::ReshapeWithOrder>(expr->get_node()); | ||
}); | ||
if (reordered_reshape_it != shape_infer_seq.cend()) { | ||
const auto& reshape = *reordered_reshape_it; | ||
const auto& etype = reshape->get_node()->get_output_element_type(0); | ||
update_io_parameters(reshape->get_input_port_descriptor(0), etype); | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we check only ReshapeWithOrder
here? Ideally, we should process all shape infer ops uniformly here. Can we process all input port descriptors in shape infer sequence? Can also skip the ones with planar layout, if needed.
This would be much more generic and will automatically work for future shape-infer ops.
It will also work is there will be more than one ReshapeWithOrder in a sequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that Reorder
(ReshapeWithOrder
) is not just reshaping. Also, this is fake op which were extracted. But we need to take descriptor not from shape infer op
consumer (brgemm for example), we should take desc of Reorder
to correctly collect layout and shape for further data offset calculation.
If we take a look at Eltwise Subgraph with blocked shapes, there will be RankNormalization
with 1
insertion to last dim.
We cannot take in desc of RankNormalization
- data offset calculation will be incorrect.
I think we need to discuss it offline 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline: decided to create the ticket for the further discussion and investigation.
Left the comment with the ticket number
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
b008e3d
to
1908740
Compare
src/plugins/intel_cpu/src/emitters/snippets/cpu_runtime_configurator.cpp
Show resolved
Hide resolved
}; | ||
|
||
class CPURuntimeConfigurator : public ov::snippets::RuntimeConfigurator { | ||
public: | ||
CPURuntimeConfigurator(); | ||
CPURuntimeConfigurator(ov::intel_cpu::MultiCacheWeakPtr cache = {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default argument is still there, please take a look
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Outdated
Show resolved
Hide resolved
src/plugins/intel_cpu/src/transformations/snippets/x64/pass/eliminate_brgemm_copy_b.cpp
Outdated
Show resolved
Hide resolved
...gins/intel_cpu/src/transformations/snippets/x64/pass/lowered/external_repacking_adjuster.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
[Snippets][CPU] Fixed build on non-x64 platforms [Snippets][CPU] Updated heuristic [Snippets][CPU] Added inplace-Transpose support [Snippets][CPU] Applied Ivan comments [Snippets][CPU] Fixed code style [Snippets][CPU] Fixed prim isa [Snippets][CPU] Small optimizations [Snippets][CPU] Fixed the build [Snippets][CPU] Applied Vladislav comments [Snippes][CPU] Created new Executors [Snippets][CPU] Applied Ivan comments 2 [Snippets][CPU] Applied Vladislav & Ivan comments 3
f104183
to
4bfdc7f
Compare
Details:
Tickets:
TODO: