Skip to content

Parallel GNN After Fusion has Problems in Lowering and Performance #91

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
johnpzh opened this issue Mar 31, 2025 · 0 comments
Open

Parallel GNN After Fusion has Problems in Lowering and Performance #91

johnpzh opened this issue Mar 31, 2025 · 0 comments
Labels
bug Something isn't working enhancement New feature or request performance

Comments

@johnpzh
Copy link
Collaborator

johnpzh commented Mar 31, 2025

A related issue with #84.

scf.forall Lowering Issue with Scalar

One of the issue is about scf.forall that only takes tensors as operands.

In the fused GNN kernel, when we use alphabetical order for the index labels, two indices will be fused and the intermediate tensor T will become a scalar.

/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_alphabet_order()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_scalar()
{
    for (h = 0 to NH) {
        for (i = 0 to NI) {
            for (k = 0 to NK) {
                t += B[i, k] * C[k, h];
            }
            for (j = 0 to NJ) {
                A[i, j] += t * D[h, j];
            }
            t = 0;
        }
    }
}

Current implementation lowers both indices h and i to parallel scf.forall. The t is a scalar and also an output of the parallel for loop, which causes a problem during lowering.

A scf.forall example using tensor operands is like this

%32:2 = scf.forall (%arg2) in (%21) shared_outs(%arg3 = %arg0, %arg4 = %arg1) -> (tensor<?x4xf64>, tensor<4xf64>) {
  %extracted_slice = tensor.extract_slice %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<?x4xf64> to tensor<1x4xf64>
  %extracted_slice_46 = tensor.extract_slice %arg4[%arg2] [1] [1] : tensor<4xf64> to tensor<1xf64>
  scf.forall.in_parallel {  /// Terminator of scf.forall
    tensor.parallel_insert_slice %extracted_slice into %arg3[%arg2, 0] [1, 4] [1, 1] : tensor<1x4xf64> into tensor<?x4xf64>
    tensor.parallel_insert_slice %extracted_slice_46 into %arg4[%arg2] [1] [1] : tensor<1xf64> into tensor<4xf64>
  }
}

In the fused case, %arg4 should be a f64, but scf.forall only takes tensors as operands. This is because scf.forall is designed to write different parts of a tensor in parallel.

error: 'scf.forall' op operand #2 must be variadic of ranked tensor of any type values, but got 'f64'

A possible solution is that we don’t make index h parallel but make k and j parallel, then we can use scf.parallel with reduce operation for scalar t.

Performance Issue in Parallel GNN

The other issue is that the allocation of private tensors might cause high overhead when each thread allocates its new private tensor in each iteration.

/// GNN kernel A = B * C * D
/// B is sparse
/// T[i, h] = B[i, k] * C[k, h];
/// A[i, j] = T[i, h] * D[h, j];

void no_fusion_index_tree_canonical_order()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[i, h] += B[i, k] * C[k, h];
            }
        }
    }
 
    for (i = 0 to NI) {
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[i, h] * D[h, j];
            }
        }
    }
}

void fusion_index_tree_tensor()
{
    for (i = 0 to NI) {
        for (k = 0 to NK) {
            for (h = 0 to NH) {
                T[h] += B[i, k] * C[k, h];
            }
        }
 
        for (h = 0 to NH) {
            for (j = 0 to NJ) {
                A[i, j] += T[h] * D[h, j];
            }
        }
        for (h = 0 to NH) {
            T[h] = 0;
        }
    }
}

In current implementation for the fusion_index_tree_tensor(), index i will be parallel, and each new allocation of T brings overhead.

This allocation overhead could also happen for other parallel kernels with sparse output (e.g., SpGEMM and Triangle Counting).

In this specific GNN case, one solution could be using parallel h and parallel j, but not parallel i.

@johnpzh johnpzh added bug Something isn't working enhancement New feature or request performance labels Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

1 participant