You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context: adding HostIr overlap algorithm to transformer's fwd mlp layer.
Let us consider the inputs of a linear layer:
x [O, DID{D}, B * S / (D * O) , E]
w0[DID{D}, 4 * E / D, E]
b0[DID{D}, 4 * E / D]
Compared to before, we added on x the axis "O", a tile of x's batch*sequence axis, corresponding to the "Stream" parallelization.
We would like to define the output as follows:
linear0 = linear(x, w0, b0) [Stream{O}, DID{D}, D, B * S / (D * O), 4 * E / D]
where, on linear0:
axis(1) "DID{D}" comes from w0 and b0's axis(0). This axis doesn't require any resharding to be produced
axis(2) "D" comes from x's axis(1), after being allgathered.
However, currently, the obtained shape of linear0 is [D, O, D, B * S / (D * O), 4 * E / D], i.e., the axis are not ordered the way we want. More precisely, we get:
The problems is that "linear" only accept w0 to be 2D, and b0 to be 1D. Currently, this problem is bypassed by manually handling this outermost sharded axis. The current behavior doesn't allow the case we are considering here. See also probably related to this comment: https://github.com/NVIDIA/Fuser/blob/main/csrc/ops/composite.cpp#L174.
What is the best approach to achieve this goal? Possible solutions I can think of:
Patch linear op to make it accept more general shape, with some restrictions, and maybe imposing on the user to properly broadcast the necessary dimension (like in at::matmul).
I suggest reshaping the tensor and collapsing all outer dimensions to make it 2D, then re-expand it. But if I do that, I'm afraid to lose the DID parallelization.
Another solution would be to manually add a "set" operation to represent allgather-ing x. But I'd rather not as the distributed matmul is treated as a single resharding op by the host IR lowering to create the pipeline algorithm (we could also change this behavior of course...)
Wdyt?
With a Matmul
I am not sure this is completely related, but IMO it can help the discussion. I tried to emulate this problem by replacing the LinearOp by a MatmulOp. This op is more convenient since at::matmul is more flexible and accepts any dimensions in the inputs as long as the user manually broadcast the necessary dimensions so that the inputs match. I wrote the following test:
C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/Fuser/csrc/expr_evaluator.cpp":438, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. When trying to propagate constant tensor sizes through the graph a conf
lict was found with 2 different sizes across dimensions that are expected to match.
For Producer TV: T3_l_float[bS10{1}, ideviceIdx.x11{i4}, iS12{i5}, iS13{i6}] (DeviceMesh{0 1}) id: iS12{i5} found size: 3
For Consumer TV: T4_g_float[iS14{i0}, ideviceIdx.x15{i4}, iS16{i2}, iS17{i6}, rS18{i3}] id: rS18{i3} found size: 5
With producer-consumer relationship through the expression: T4_g_float[iS14{i0}, ideviceIdx.x15{i4}, iS16{i2}, iS17{i6}, rS18{i3}]
= matmul(T2_l_float[ideviceIdx.x6{i0}, bS7{1}, iS8{i2}, iS9{i3}] (DeviceMesh{0 1}),
T3_l_float[bS10{1}, ideviceIdx.x11{i4}, iS12{i5}, iS13{i6}] (DeviceMesh{0 1}))
I feel it indicates something is broken in the logic, but I'd be happy to hear your thoughts
Anyhow, extending LinearOp to support more shapes makes sense. #3073 has been on that line. I'm unsure why it failed for you.
Fixing #2563 will be the ultimate fix. This way, we don't need to extend the "logical" definition of LinearOp. #3650 from @Priya2698 may have already fixed some aspects, but I wouldn't be surprised if it doesn't work out of the box for the new Stream parallel type.
The problem
Context: adding HostIr overlap algorithm to transformer's fwd mlp layer.
Let us consider the inputs of a linear layer:
Compared to before, we added on
x
the axis "O", a tile ofx
's batch*sequence axis, corresponding to the "Stream" parallelization.We would like to define the output as follows:
where, on
linear0
:DID{D}
" comes fromw0
andb0
's axis(0). This axis doesn't require any resharding to be producedD
" comes fromx
's axis(1), after being allgathered.However, currently, the obtained shape of linear0 is
[D, O, D, B * S / (D * O), 4 * E / D]
, i.e., the axis are not ordered the way we want. More precisely, we get:The problems is that "linear" only accept
w0
to be 2D, andb0
to be 1D. Currently, this problem is bypassed by manually handling this outermost sharded axis. The current behavior doesn't allow the case we are considering here. See also probably related to this comment: https://github.com/NVIDIA/Fuser/blob/main/csrc/ops/composite.cpp#L174.What is the best approach to achieve this goal? Possible solutions I can think of:
linear
op to make it accept more general shape, with some restrictions, and maybe imposing on the user to properly broadcast the necessary dimension (like inat::matmul
).x
. But I'd rather not as the distributed matmul is treated as a single resharding op by the host IR lowering to create the pipeline algorithm (we could also change this behavior of course...)Wdyt?
With a Matmul
I am not sure this is completely related, but IMO it can help the discussion. I tried to emulate this problem by replacing the
LinearOp
by aMatmulOp
. This op is more convenient sinceat::matmul
is more flexible and accepts any dimensions in the inputs as long as the user manually broadcast the necessary dimensions so that the inputs match. I wrote the following test:But I am getting an error:
I feel it indicates something is broken in the logic, but I'd be happy to hear your thoughts
@wujingyue @naoyam @cowanmeg @Priya2698
The text was updated successfully, but these errors were encountered: