Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA '2D' Algorithm Questions #14

Open
asglover opened this issue Jul 11, 2024 · 5 comments
Open

CUDA '2D' Algorithm Questions #14

asglover opened this issue Jul 11, 2024 · 5 comments

Comments

@asglover
Copy link

Hello team,

My name is Austin Glover, and I'm currently trying to create fast GPU implementations of the sparse contraction operation that is central to the tensor product. I just found your library and discovered that you've had a lot of good ideas about how to accelerate this operation. One of the ideas in the cuda_extension folder is sparse_accumulation_cuda_kernel2D.cu. If I understand the idea correctly, the idea is to have each thread do the "outer product" of the "feature dimension" in the inner most loop. So if you had inputs like (batch, in1_feature_dim, in1_irrep_dim) and (batch, in2_feature_dim, in2_irrep_dim) each thread would create one batch of (in1_feature_dim, in2_feature_dim, output_irrep_dim). This is a clever way to increase the "arithmetic intensity" of the "sparse operation". I was curious if you found this approach to be successful in providing a speedup?

This whole project is a really cool and clever approach to the problem!

  • Austin
@Luthaf
Copy link
Contributor

Luthaf commented Jul 12, 2024

Hi,

The code in this repo has been superseded by https://github.com/lab-cosmo/mops, containing the same operation as here as well as a couple of others, and is where all new development is happening.

I was not personally involved in writing the CUDA code in this repo, but I'll try to find someone how understands it to answer your questions!

@nickjbrowning
Copy link

nickjbrowning commented Jul 12, 2024

Hey Austin,

I wasn't involved in this project but I am pretty experienced with writing these ops, so I might be able to offer some insight. If you take a look at the function definition here:

/* Forward propagation function
* It computes the sparce accumulation assuming the last
* dimension being the active-sparse dimension.
*
* The operation can be summarized in this equation:
*
* output_tensor[isample,ifeatures,idx_output[m]] = X1[[isample,ifeatures,idx_1[m]]] * X2[[isample,ifeatures,idx_1[m]]] * multipliers[m]
*
* input:
* X1[nsample,nfeatures,m1]: First input tensor
* X2[nsample,nfeatures,m2]: Second input Tensor
* idx_output: Tensor with the indeces of the third dimension of the output_tensor
* output_size: third dimension of the output Tensor
* idx_1: Tensor with the indeces of the third dimension of X1
* idx_2: Tensor with the indeces of the third dimension of X2
* multipliers: Tensor containing the multipliers for the sparse accumulations
* output:
* output_tensor[nsample,nfeatures,output_size]: output Tensor
*/

you'll see that it takes as input:

X1[nsample,nfeatures,m1]
X2[nsample,nfeatures,m2]

and outputs:

output_tensor[nsample,nfeatures,output_size]

The code does not additionally do an outer product over the features of X1 and features of X2, which is what you've written in your question.

I would say that the performance of this code mostly comes from this loop:

for (int z = 0 ; z < nz ; ++z){
z_output = buffer_idx_output[z];
if (z_old != z_output) {
output_final[z_old] = now;
now = 0;
z_old = z_output;
}
z_X1 = buffer_idx_X1[z];
z_X2 = buffer_idx_X2[z];
now += buffer_X1_final[z_X1]
* buffer_X2_final[z_X2]
* buffer_multipliers[z];
};
output_final[z_old] = now;

where it just keeps a per-thread running sum over the output index list, and only writes out when the next index in the list is different to the current. This prevents needing to use shared memory + atomics.

@asglover
Copy link
Author

asglover commented Jul 12, 2024

Hey Austin,

I wasn't involved in this project but I am pretty experienced with writing these ops, so I might be able to offer some insight. If you take a look at the function definition here:

/* Forward propagation function
* It computes the sparce accumulation assuming the last
* dimension being the active-sparse dimension.
*
* The operation can be summarized in this equation:
*
* output_tensor[isample,ifeatures,idx_output[m]] = X1[[isample,ifeatures,idx_1[m]]] * X2[[isample,ifeatures,idx_1[m]]] * multipliers[m]
*
* input:
* X1[nsample,nfeatures,m1]: First input tensor
* X2[nsample,nfeatures,m2]: Second input Tensor
* idx_output: Tensor with the indeces of the third dimension of the output_tensor
* output_size: third dimension of the output Tensor
* idx_1: Tensor with the indeces of the third dimension of X1
* idx_2: Tensor with the indeces of the third dimension of X2
* multipliers: Tensor containing the multipliers for the sparse accumulations
* output:
* output_tensor[nsample,nfeatures,output_size]: output Tensor
*/

you'll see that it takes as input:

X1[nsample,nfeatures,m1]
X2[nsample,nfeatures,m2]

and outputs:

output_tensor[nsample,nfeatures,output_size]

okay, I wasn't sure if all the nfeatures necessarily had to be the same.
Also on line 250 I think it should be:
X2[[isample,ifeatures,idx_2[m]]]
instead of
X2[[isample,ifeatures,idx_1[m]]]

The code does not additionally do an outer product over the features of X1 and features of X2, which is what you've written in your question.

That's interesting, so is the nfeatures dimension any different than the batch dimension? (assuming you're just the same CG contraction for every input)
Are out1 and out2 equivalent?

x1 = tensor(batch, nfeatures, m1)
x2 = tensor(batch, nfeatures, m2)

out_1 = sparse_accumulate(x1,x2, ... )

x1.reshape(-1, 1, m1)
x2.reshape(-1, 1, m2)

out_2 = sparse_accumulate(x1,x2, ...)

out2 = out2.reshape(batch, nfeatures, -1)

I'll try and get the repo running and confirm this, but it doesn't look like the "2D" kernel is the one implemented. It looks like the "non 2D" cuda kernel is the one that is linked, and it currently has a lot commented out. I'll try to get things running then come back

I would say that the performance of this code mostly comes from this loop:

for (int z = 0 ; z < nz ; ++z){
z_output = buffer_idx_output[z];
if (z_old != z_output) {
output_final[z_old] = now;
now = 0;
z_old = z_output;
}
z_X1 = buffer_idx_X1[z];
z_X2 = buffer_idx_X2[z];
now += buffer_X1_final[z_X1]
* buffer_X2_final[z_X2]
* buffer_multipliers[z];
};
output_final[z_old] = now;

where it just keeps a per-thread running sum over the output index list, and only writes out when the next index in the list is different to the current. This prevents needing to use shared memory + atomics.

Yeah, I noticed that the Sparse-Accumulation-of-Products in the mops use the shared memory atomics every time and was wondering if that was a performance driven change, or if it was just a simpler approach. It sounds like you're saying this mehtod check where you would do as much local accumulation in the thread as possible before writing out to shared memory is faster.

Do you know what the conclusion was of this project?

@asglover
Copy link
Author

asglover commented Jul 12, 2024

Also, I take it back I see that the 2D algorithm is referenced in the setup.py while the regular algorithm is referenced in the jit.py in the cuda_extension folder. I'm assuming that the 2D algorithm is the functional one, as the other file is largely commented out.

Can you recommended a PyTorch / Python version on which this can successfully build? I'm getting import errors when I try and run the tests for CPU and GPU. (which may very easily be my fault)

___________________________________________________ ERROR collecting tests/test_cpp_contiguous.py ____________________________________________________
ImportError while importing test module '/global/u1/a/aglover/sparse_accumulation/tests/test_cpp_contiguous.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cpp_contiguous.py:2: in <module>
    from sparse_accumulation.clebsch_gordan import get_real_clebsch_gordan, ClebschGordan
sparse_accumulation/__init__.py:1: in <module>
    from .unified_operation import accumulate
sparse_accumulation/unified_operation.py:2: in <module>
    import sparse_accumulation_active_dim_last_cpp
E   ImportError: /global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/site-packages/sparse_accumulation_active_dim_last_cpp.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
____________________________________________________ ERROR collecting tests/test_cpp_jit_cuda.py _____________________________________________________
ImportError while importing test module '/global/u1/a/aglover/sparse_accumulation/tests/test_cpp_jit_cuda.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_cpp_jit_cuda.py:8: in <module>
    from sparse_accumulation.clebsch_gordan import ClebschGordan, get_real_clebsch_gordan
sparse_accumulation/__init__.py:1: in <module>
    from .unified_operation import accumulate
sparse_accumulation/unified_operation.py:2: in <module>
    import sparse_accumulation_active_dim_last_cpp
E   ImportError: /global/homes/a/aglover/.conda/envs/e3nn/lib/python3.9/site-packages/sparse_accumulation_active_dim_last_cpp.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs

My major software versions are:

python                    3.9.19          h0755675_0_cpython    conda-forge
pytorch                   2.3.1           cuda120_py39h17b67e0_300    conda-forge

@Luthaf
Copy link
Contributor

Luthaf commented Jul 13, 2024

Can you recommended a PyTorch / Python version on which this can successfully build?

You should use https://github.com/lab-cosmo/mops for this, the algorithm corresponding to this repo is the sparse accumulation of products. This repo is no longer worked on by us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants