You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the issue is about scf.forall that only takes tensors as operands.
In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor T will become a scalar.
/// GNN kernel A = B * C * D/// B is sparse/// T[i, h] = B[i, k] * C[k, h];/// A[i, j] = T[i, h] * D[h, j];voidno_fusion_index_tree_alphabet_order()
{
for (h = 0 to NH) {
for (i = 0 to NI) {
for (k = 0 to NK) {
T[i, h] += B[i, k] * C[k, h];
}
}
}
for (h = 0 to NH) {
for (i = 0 to NI) {
for (j = 0 to NJ) {
A[i, j] += T[i, h] * D[h, j];
}
}
}
}
voidfusion_index_tree_scalar()
{
for (h = 0 to NH) {
for (i = 0 to NI) {
for (k = 0 to NK) {
t += B[i, k] * C[k, h];
}
for (j = 0 to NJ) {
A[i, j] += t * D[h, j];
}
t = 0;
}
}
}
Current implementation lowers both indices h and i to parallel scf.forall. The t is a scalar and also an output of the parallel for loop, which causes a problem during lowering.
A scf.forall example using tensor operands is like this
In the fused case, %arg4 should be a f64, but scf.forall only takes tensors as operands. This is because scf.forall is designed to write different parts of a tensor in parallel.
error: 'scf.forall' op operand #2 must be variadic of ranked tensor of any type values, but got 'f64'
A possible solution is that we don’t make index h parallel but make k and j parallel, then we can use scf.parallel with reduce operation for scalar t.
Performance Issue in Parallel GNN
The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.
/// GNN kernel A = B * C * D/// B is sparse/// T[i, h] = B[i, k] * C[k, h];/// A[i, j] = T[i, h] * D[h, j];voidno_fusion_index_tree_canonical_order()
{
for (i = 0 to NI) {
for (k = 0 to NK) {
for (h = 0 to NH) {
T[i, h] += B[i, k] * C[k, h];
}
}
}
for (i = 0 to NI) {
for (h = 0 to NH) {
for (j = 0 to NJ) {
A[i, j] += T[i, h] * D[h, j];
}
}
}
}
voidfusion_index_tree_tensor()
{
for (i = 0 to NI) {
for (k = 0 to NK) {
for (h = 0 to NH) {
T[h] += B[i, k] * C[k, h];
}
}
for (h = 0 to NH) {
for (j = 0 to NJ) {
A[i, j] += T[h] * D[h, j];
}
}
for (h = 0 to NH) {
T[h] = 0;
}
}
}
In current implementation for the fusion_index_tree_tensor(), index i will be parallel, and each new allocation of T brings overhead.
This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).
In this specific GNN case, one solution could be using parallel h and parallel j, but not parallel i.
The text was updated successfully, but these errors were encountered:
A related issue with #84.
scf.forall
Lowering Issue with ScalarOne of the issue is about
scf.forall
that only takes tensors as operands.In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor
T
will become a scalar.Current implementation lowers both indices
h
andi
to parallelscf.forall
. Thet
is a scalar and also an output of the parallel for loop, which causes a problem during lowering.A
scf.forall
example using tensor operands is like thisIn the fused case,
%arg4
should be af64
, butscf.forall
only takes tensors as operands. This is becausescf.forall
is designed to write different parts of a tensor in parallel.A possible solution is that we don’t make index
h
parallel but makek
andj
parallel, then we can usescf.parallel
withreduce
operation for scalart
.Performance Issue in Parallel GNN
The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.
In current implementation for the
fusion_index_tree_tensor()
, indexi
will be parallel, and each new allocation ofT
brings overhead.This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).
In this specific GNN case, one solution could be using parallel
h
and parallelj
, but not paralleli
.The text was updated successfully, but these errors were encountered: