You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,
logical: iM, iN
allocation: iDIDx{D}, iM/D, iN
The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.
TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
auto fusion = std::make_unique<Fusion>();
FusionGuard fg(fusion.get());
TensorView* in = makeContigConcreteTensor({6});
TensorView* out = set(in);
fusion->addInput(in);
fusion->addOutput(out);
auto [outer, inner] = IterDomain::split(
in->axis(0), IrBuilder::create<Val>(2, DataType::Index), true);
in->setAllocationDomain({outer, inner}, true);
FusionExecutorCache executor_cache(std::move(fusion));
auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
at::Tensor in_tensor = at::randn({6}, options);
auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});
testValidate(
executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}
This repro currently fails with the following error:
C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. dom0 has unreachable IDs. dom0: iS7{6}. dom1:
tries to propagate allocation from the producer to the consumer. However, the BestEffortReplay created fails to map the producer's allocation domain. As a result,
if (auto it = p2c_map.find(id); it != p2c_map.end()) {
fails to add anything to the consumer's allocation domain, leading to an empty allocation domain that doesn't post-dominate logical.
While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving replayCasP, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnm
The text was updated successfully, but these errors were encountered:
This is a spin-off from #3458 (comment).
For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,
The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.
This repro currently fails with the following error:
Code around
Fuser/csrc/transform_replay.cpp
Lines 760 to 763 in 9c9c34c
Fuser/csrc/transform_replay.cpp
Line 776 in 9c9c34c
While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving
replayCasP
, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnmThe text was updated successfully, but these errors were encountered: