Parallel GNN After Fusion has Problems in Lowering and Performance #91

johnpzh · 2025-03-31T16:49:06Z

A related issue with #84.

`scf.forall` Lowering Issue with Scalar

One of the issue is about scf.forall that only takes tensors as operands.

In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor T will become a scalar.

/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_alphabet_order()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_scalar()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                t += B[i, k] * C[k, h];
            }
            for (j = 0 to NJ) {
                A[i, j] += t * D[h, j];
            }
            t = 0;
        }
    }
}

Current implementation lowers both indices h and i to parallel scf.forall. The t is a scalar and also an output of the parallel for loop, which causes a problem during lowering.

A scf.forall example using tensor operands is like this

%32:2 = scf.forall (%arg2) in (%21) shared_outs(%arg3 = %arg0, %arg4 = %arg1) -> (tensor<?x4xf64>, tensor<4xf64>) {
  %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<?x4xf64> to tensor<1x4xf64>
  %extracted_slice_46 = tensor.extract_slice %arg4[%arg2] [1] [1] : tensor<4xf64> to tensor<1xf64>
  scf.forall.in_parallel {  /// Terminator of scf.forall
    tensor.parallel_insert_slice %extracted_slice into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<1x4xf64> into tensor<?x4xf64>
    tensor.parallel_insert_slice %extracted_slice_46 into %arg4[%arg2] [1] [1] : tensor<1xf64> into tensor<4xf64>
  }
}

In the fused case, %arg4 should be a f64, but scf.forall only takes tensors as operands. This is because scf.forall is designed to write different parts of a tensor in parallel.

error: 'scf.forall' op operand #2 must be variadic of ranked tensor of any type values, but got 'f64'

A possible solution is that we don’t make index h parallel but make k and j parallel, then we can use scf.parallel with reduce operation for scalar t.

Performance Issue in Parallel GNN

The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.

/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_canonical_order()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
 
    for (i = 0 to NI) {
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_tensor()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[h] += B[i, k] * C[k, h];
            }
        }
 
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[h] * D[h, j];
            }
        }
        for (h = 0 to NH) {
            T[h] = 0;
        }
    }
}

In current implementation for the fusion_index_tree_tensor(), index i will be parallel, and each new allocation of T brings overhead.

This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).

In this specific GNN case, one solution could be using parallel h and parallel j, but not parallel i.

The text was updated successfully, but these errors were encountered:

johnpzh added bug Something isn't working enhancement New feature or request performance labels Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel GNN After Fusion has Problems in Lowering and Performance #91

Parallel GNN After Fusion has Problems in Lowering and Performance #91

johnpzh commented Mar 31, 2025

Parallel GNN After Fusion has Problems in Lowering and Performance #91

Parallel GNN After Fusion has Problems in Lowering and Performance #91

Comments

johnpzh commented Mar 31, 2025

scf.forall Lowering Issue with Scalar

Performance Issue in Parallel GNN

`scf.forall` Lowering Issue with Scalar