what's the purpose of CUB_SUBSCRIPTION_FACTOR #1060

zhaolianshuizls · 2023-08-16T08:57:05Z

zhaolianshuizls
Aug 16, 2023

I‘m really confused about this CUB_SUBSCRIPTION_FACTOR (https://github.com/NVIDIA/cub/blob/b2e8bccb8c0cd15279974fe4b9b8d6fcd1842b57/cub/device/dispatch/dispatch_reduce.cuh#L753).

Can you explain to me why max_blocks needs it?

Answered by gevtushenko

Aug 16, 2023

The work / CTA partitioning in our reduce is static, meaning it's a function of problem size/architecture (no work stealing). The subscription factor is intended to improve load balancing. Imagine a GPU with 2 SMs, each holding only one CTA at a time. If we launch only 2 CTAs and one of them gets all the simple work, it'll finish early, keeping GPU underutilized. On the other hand, if we had another CTA available, it'd replace the finished one.

View full answer

gevtushenko · 2023-08-16T09:40:45Z

gevtushenko
Aug 16, 2023
Maintainer

The work / CTA partitioning in our reduce is static, meaning it's a function of problem size/architecture (no work stealing). The subscription factor is intended to improve load balancing. Imagine a GPU with 2 SMs, each holding only one CTA at a time. If we launch only 2 CTAs and one of them gets all the simple work, it'll finish early, keeping GPU underutilized. On the other hand, if we had another CTA available, it'd replace the finished one.

4 replies

zhaolianshuizls Aug 17, 2023
Author

@senior-zero Thanks for your clarification. In your example, can you help me with a few questions?

Imagine a GPU with 2 SMs, each holding only one CTA at a time. If we launch only 2 CTAs and one of them gets all the simple work, it'll finish early, keeping GPU underutilized.

I assume in your example, each SM can only execute one CTA at a time, and launching 2 CTAs is just a single cuda kernel launch with 2 thread blocks. So here are my questions.

Since we launch 2 thread blocks and each SM can only have one block run at a time, shouldn't these two thread blocks be distributed to both SMs, one thread block each?
Base on my first question, each thread block gets the same amount of work, how come it will finish early?

gevtushenko Aug 17, 2023
Maintainer

shouldn't these two thread blocks be distributed to both SMs, one thread block each?

Yes, in the example above thread blocks distributed to both SMs.

each thread block gets the same amount of work, how come it will finish early?

Each thread block gets the same number of items to process. The amount of work depends on the user operator and the item type. For instance, if you use min operator on complex type, there's an expensive comparison branch and a cheap one. The branch is selected based on the item value. If there's a tile of items falling into the cheap branch, given thread block might finish significantly earlier than others.

zhaolianshuizls Aug 17, 2023
Author

@senior-zero The scenario you talked about seems to be not common. Usually a kernel operates on the same type of input, right?

gevtushenko Aug 17, 2023
Maintainer

The input type is the same, the input values are not:

template <>
__device__ inline bool less_t::operator()(const complex &lhs, const complex &rhs) {
  double magnitude_0 = cuda::std::abs(lhs);
  double magnitude_1 = cuda::std::abs(rhs);

  // ...

  const complex::value_type difference = cuda::std::abs(magnitude_0 - magnitude_1);
  const complex::value_type threshold = cuda::std::numeric_limits<complex::value_type>::epsilon() * 2;

  if (difference < threshold) { //              <---- expensive branch
    const complex::value_type phase_angle_0 = cuda::std::arg(lhs);
    const complex::value_type phase_angle_1 = cuda::std::arg(rhs);

    return phase_angle_0 < phase_angle_1;
  } else { //                                   <---- cheap branch
    return magnitude_0 < magnitude_1;
  }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the purpose of CUB_SUBSCRIPTION_FACTOR #1060

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

what's the purpose of CUB_SUBSCRIPTION_FACTOR #1060

zhaolianshuizls Aug 16, 2023

Replies: 0 comments · 5 replies

gevtushenko Aug 16, 2023 Maintainer

zhaolianshuizls Aug 17, 2023 Author

gevtushenko Aug 17, 2023 Maintainer

zhaolianshuizls Aug 17, 2023 Author

gevtushenko Aug 17, 2023 Maintainer

zhaolianshuizls
Aug 16, 2023

Replies: 0 comments 5 replies

gevtushenko
Aug 16, 2023
Maintainer

zhaolianshuizls Aug 17, 2023
Author

gevtushenko Aug 17, 2023
Maintainer

zhaolianshuizls Aug 17, 2023
Author

gevtushenko Aug 17, 2023
Maintainer