Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow allocation to be a split and different from loop. #3479

Open
wujingyue opened this issue Nov 26, 2024 · 0 comments
Open

Allow allocation to be a split and different from loop. #3479

wujingyue opened this issue Nov 26, 2024 · 0 comments
Assignees

Comments

@wujingyue
Copy link
Collaborator

wujingyue commented Nov 26, 2024

This is a spin-off from #3458 (comment).

For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,

logical: iM, iN
allocation: iDIDx{D}, iM/D, iN

The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.

TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  TensorView* in = makeContigConcreteTensor({6});
  TensorView* out = set(in);
  fusion->addInput(in);
  fusion->addOutput(out);

  auto [outer, inner] = IterDomain::split(
      in->axis(0), IrBuilder::create<Val>(2, DataType::Index), true);
  in->setAllocationDomain({outer, inner}, true);

  FusionExecutorCache executor_cache(std::move(fusion));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
  at::Tensor in_tensor = at::randn({6}, options);
  auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

  testValidate(
      executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

This repro currently fails with the following error:

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. dom0 has unreachable IDs. dom0: iS7{6}. dom1:

Code around

auto replay_CasP = BestEffortReplay(
new_IDs,
producer->getLoopDomain(),
logical_map.mapProducerToConsumer(producer->domain(), replayed));
tries to propagate allocation from the producer to the consumer. However, the BestEffortReplay created fails to map the producer's allocation domain. As a result,
if (auto it = p2c_map.find(id); it != p2c_map.end()) {
fails to add anything to the consumer's allocation domain, leading to an empty allocation domain that doesn't post-dominate logical.

While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving replayCasP, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnm

wujingyue added a commit that referenced this issue Nov 26, 2024
@wujingyue wujingyue changed the title Allocation can't be a split unless it's same as loop. Allow allocation to be a split and different from loop. Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants