Allow allocation to be a split and different from loop. #3479

wujingyue · 2024-11-26T18:43:15Z

This is a spin-off from #3458 (comment).

For a multi-GPU fusion to take a sharded input tensor, the allocation domain has to be a split of logical. For example,

logical: iM, iN
allocation: iDIDx{D}, iM/D, iN

The loop domain, ideally, shouldn't be set or used because a fusion/segment input comes from outside and is not generated by a loop. Below is an ideal usage, which I also committed to https://github.com/NVIDIA/Fuser/tree/bug3479.

TEST_F(AllocationDomainTest, InputAllocationIsSplit_Concrete) {
  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  TensorView* in = makeContigConcreteTensor({6});
  TensorView* out = set(in);
  fusion->addInput(in);
  fusion->addOutput(out);

  auto [outer, inner] = IterDomain::split(
      in->axis(0), IrBuilder::create<Val>(2, DataType::Index), true);
  in->setAllocationDomain({outer, inner}, true);

  FusionExecutorCache executor_cache(std::move(fusion));
  auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
  at::Tensor in_tensor = at::randn({6}, options);
  auto out_tensors = executor_cache.runFusionWithInputs({in_tensor});

  testValidate(
      executor_cache.fusion(), out_tensors, {in_tensor}, __LINE__, __FILE__);
}

This repro currently fails with the following error:

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/ir/utils.cpp":965, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. dom0 has unreachable IDs. dom0: iS7{6}. dom1:

Code around

Fuser/csrc/transform_replay.cpp

Lines 760 to 763 in 9c9c34c

    
           auto replay_CasP = BestEffortReplay( 
        
               new_IDs, 
        
               producer->getLoopDomain(), 
        
               logical_map.mapProducerToConsumer(producer->domain(), replayed));

tries to propagate allocation from the producer to the consumer. However, the BestEffortReplay created fails to map the producer's allocation domain. As a result,

Fuser/csrc/transform_replay.cpp

Line 776 in 9c9c34c

if (auto it = p2c_map.find(id); it != p2c_map.end()) {

fails to add anything to the consumer's allocation domain, leading to an empty allocation domain that doesn't post-dominate logical.

While I've worked around this problem by setting loop to be the same as allocation, @naoyam and I discussed some potential solutions in the original thread. There's a line of thoughts on improving replayCasP, and there's a line of thoughts on propagating allocation domain within a kernel through a different mechanism as this is currently only needed for matmul. cc @zasdfgbnm

The text was updated successfully, but these errors were encountered:

wujingyue added a commit that referenced this issue Nov 26, 2024

Add a repro for #3479

22a05d5

wujingyue assigned naoyam Nov 26, 2024

wujingyue changed the title ~~Allocation can't be a split unless it's same as loop.~~ Allow allocation to be a split and different from loop. Nov 27, 2024

wujingyue mentioned this issue Nov 29, 2024

Unshard tensor sizes before binding. #3444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow allocation to be a split and different from loop. #3479

Allow allocation to be a split and different from loop. #3479

wujingyue commented Nov 26, 2024 •

edited

Loading

Allow allocation to be a split and different from loop. #3479

Allow allocation to be a split and different from loop. #3479

Comments

wujingyue commented Nov 26, 2024 • edited Loading

wujingyue commented Nov 26, 2024 •

edited

Loading