Not able to reproduce R1 throughput 5x against H200 #245

ghostplant · 2025-03-25T23:16:13Z

I see Aiter accelerated DeepSeek R1 has been 5x throughput of H200, what's the proper setting to reproduce it?

valarLip · 2025-03-26T02:13:59Z

https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html
https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html
you can follow these two

ghostplant · 2025-03-26T07:50:51Z

It is exactly what I tried, but I just get 600 TPS using concurrency = 32.
I can get 1K TPS if maximizing concurrency to 128 (still 5x slower than reported), and achieving this is even at the cost of "the response latency growing up to 128ms, not 50ms mentioned in the report."

Is there anything wrong on my evaluation?

valarLip · 2025-03-27T02:54:59Z

@andyluo7 could you please give some comments

andyluo7 · 2025-03-27T07:07:35Z

@ghostplant, you should be able to reproduce following https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html. If not, please send how you ran the benchmark to [email protected].

ghostplant · 2025-03-27T18:26:21Z

What I get is 780 TPS using bsz = 32, which is far from 5000 TPS

sunway513 · 2025-03-28T18:18:53Z

I was able to repro the numbers using a different setup.. @ghostplant what's your MI300x CSP?

sunway513 · 2025-03-28T18:22:25Z

What I get is 780 TPS using bsz = 32, which is far from 5000 TPS

wait.. are you talking about output tokens or all tokens? 5k TPS should be the latter.

ghostplant · 2025-03-28T18:37:56Z

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.

So it is still below half of 5K.

sunway513 · 2025-03-29T03:49:46Z

Why do you need to calculate it? if you go with the demonstrated sglang command, you should be able to find the perf metrics in the output. cc @andyluo7 to correct me if I'm wrong.

ghostplant · 2025-03-29T06:44:00Z

Because the gap with SLANG's realistic speed (780 TPs) is huge. So I have no idea why it prints a very large value that realistic is not achievable?

valarLip · 2025-03-29T06:50:01Z

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.

So it is still below half of 5K.

look into the formula sglang used, we can see it should be (4000 * 30 + 780 * 30)/(30) = 4780 TPS.
input_throughput=total_input / dur_s,
output_throughput=sum(output_lens) / dur_s,
total_throughput=(total_input + sum(output_lens)) / dur_s,

ghostplant · 2025-03-30T00:28:53Z

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.
So it is still below half of 5K.

look into the formula sglang used, we can see it should be (4000 * 30 + 780 * 30)/(30) = 4780 TPS. input_throughput=total_input / dur_s, output_throughput=sum(output_lens) / dur_s, total_throughput=(total_input + sum(output_lens)) / dur_s,

OK, I can use large prefill size and small decode size to get above 5K tokens. But considering a realistic thinking task requires small prefill size with large decode size, then I just get 0.8K, but I think it is not the case for the benchmark arguments you provided, whose numbers will focus on maximizing the numbers to report. Thanks!

ghostplant changed the title ~~How to reproduce 5x R1 throughput against H200?~~ How to reproduce R1 throughput 5x against H200? Mar 25, 2025

ghostplant changed the title ~~How to reproduce R1 throughput 5x against H200?~~ Not able to reproduce R1 throughput 5x against H200 Mar 26, 2025

ghostplant closed this as completed Mar 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to reproduce R1 throughput 5x against H200 #245

Not able to reproduce R1 throughput 5x against H200 #245

ghostplant commented Mar 25, 2025

valarLip commented Mar 26, 2025 •

edited

Loading

ghostplant commented Mar 26, 2025 •

edited

Loading

valarLip commented Mar 27, 2025

andyluo7 commented Mar 27, 2025

ghostplant commented Mar 27, 2025

sunway513 commented Mar 28, 2025

sunway513 commented Mar 28, 2025

ghostplant commented Mar 28, 2025

sunway513 commented Mar 29, 2025

ghostplant commented Mar 29, 2025

valarLip commented Mar 29, 2025

ghostplant commented Mar 30, 2025 •

edited

Loading

Not able to reproduce R1 throughput 5x against H200 #245

Not able to reproduce R1 throughput 5x against H200 #245

Comments

ghostplant commented Mar 25, 2025

valarLip commented Mar 26, 2025 • edited Loading

ghostplant commented Mar 26, 2025 • edited Loading

valarLip commented Mar 27, 2025

andyluo7 commented Mar 27, 2025

ghostplant commented Mar 27, 2025

sunway513 commented Mar 28, 2025

sunway513 commented Mar 28, 2025

ghostplant commented Mar 28, 2025

sunway513 commented Mar 29, 2025

ghostplant commented Mar 29, 2025

valarLip commented Mar 29, 2025

ghostplant commented Mar 30, 2025 • edited Loading

valarLip commented Mar 26, 2025 •

edited

Loading

ghostplant commented Mar 26, 2025 •

edited

Loading

ghostplant commented Mar 30, 2025 •

edited

Loading