Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: cuda backend improvements #145

Closed
GPMueller opened this issue Mar 10, 2017 · 6 comments
Closed

Core: cuda backend improvements #145

GPMueller opened this issue Mar 10, 2017 · 6 comments

Comments

@GPMueller
Copy link
Member

GPMueller commented Mar 10, 2017

The speed on maxwell and pascal is roughly equal, which should not be the case. It seems that on pascal some quite trivial kernels take far longer than they should.
A cause for this might be that on pascal not all of a shared memory object is pre-fetched when it is accessed. This may cause significant slowdown in incremental accesses, i.e. in such cases arrays should be pre-fetched by calling cudaMemPrefetchAsync - but maybe cudaMemAdvise could suffice.

  • cu_project_tangential
  • cu_sum
  • cu_set_c_a2
@GPMueller
Copy link
Member Author

It seems the cudaMemPrefetchAsync has negative impact on the performance... Strangely it seems to significantly decrease the number of iterations after which the IPS drops.

Another idea is the following: each iteration, cudaMallocManaged is called 10x, which may indicate an unnecessary copy (missing reference? an = where it doesn't belong?).

@GPMueller
Copy link
Member Author

It seems the removal of several cudaMallocManaged calls had significant impact: 23cd093

@GPMueller
Copy link
Member Author

The removal of systems[0]->UpdateEnergy(); from Method_LLG::Hook_Post_Iteration() improves performance by another factor ~1.7 but it is unclear to me why.

@GPMueller
Copy link
Member Author

GPMueller commented Mar 15, 2017

The answer seems to be that the gradient and energy calculations for Exchange and DMI are all very costly due to the use of atomics.

A new scheme of the Hamiltonian is needed to make it better suitable for such parallelisations.

Related to #101 and #146

@GPMueller
Copy link
Member Author

New schemes for the Hamiltonians have been implemented (8569ae5). #222 now tracks the implementation of the cuda versions of these.

@GPMueller
Copy link
Member Author

Further performance improvements will, at my level of expertise, be algorithmic (see e.g. #311) and not in terms of better CUDA code. Therefore closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant