You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The CUDA C programming guide mentions on page 13:
However, in the code for agent scan there is this comment:
https://github.com/NVIDIA/cub/blob/5d12837f92ee12016827ad6f1ccbbc963eb428ff/cub/agent/agent_scan.cuh#L408-L412
Does this mean that scan is using undefined behavior here, or have I missed some specification of CUDA?
Beta Was this translation helpful? Give feedback.
All reactions