Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to reproduce R1 throughput 5x against H200 #245

Closed
ghostplant opened this issue Mar 25, 2025 · 12 comments
Closed

Not able to reproduce R1 throughput 5x against H200 #245

ghostplant opened this issue Mar 25, 2025 · 12 comments

Comments

@ghostplant
Copy link

I see Aiter accelerated DeepSeek R1 has been 5x throughput of H200, what's the proper setting to reproduce it?

@ghostplant ghostplant changed the title How to reproduce 5x R1 throughput against H200? How to reproduce R1 throughput 5x against H200? Mar 25, 2025
@valarLip
Copy link
Collaborator

valarLip commented Mar 26, 2025

@ghostplant
Copy link
Author

ghostplant commented Mar 26, 2025

It is exactly what I tried, but I just get 600 TPS using concurrency = 32.
I can get 1K TPS if maximizing concurrency to 128 (still 5x slower than reported), and achieving this is even at the cost of "the response latency growing up to 128ms, not 50ms mentioned in the report."

Is there anything wrong on my evaluation?

@ghostplant ghostplant changed the title How to reproduce R1 throughput 5x against H200? Not able to reproduce R1 throughput 5x against H200 Mar 26, 2025
@valarLip
Copy link
Collaborator

@andyluo7 could you please give some comments

@andyluo7
Copy link

@ghostplant, you should be able to reproduce following https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html. If not, please send how you ran the benchmark to [email protected].

@ghostplant
Copy link
Author

What I get is 780 TPS using bsz = 32, which is far from 5000 TPS

@sunway513
Copy link
Collaborator

I was able to repro the numbers using a different setup.. @ghostplant what's your MI300x CSP?

@sunway513
Copy link
Collaborator

What I get is 780 TPS using bsz = 32, which is far from 5000 TPS

wait.. are you talking about output tokens or all tokens? 5k TPS should be the latter.

@ghostplant
Copy link
Author

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.

So it is still below half of 5K.

@sunway513
Copy link
Collaborator

Why do you need to calculate it? if you go with the demonstrated sglang command, you should be able to find the perf metrics in the output. cc @andyluo7 to correct me if I'm wrong.

@ghostplant
Copy link
Author

Because the gap with SLANG's realistic speed (780 TPs) is huge. So I have no idea why it prints a very large value that realistic is not achievable?

@valarLip
Copy link
Collaborator

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.

So it is still below half of 5K.

look into the formula sglang used, we can see it should be (4000 * 30 + 780 * 30)/(30) = 4780 TPS.
input_throughput=total_input / dur_s,
output_throughput=sum(output_lens) / dur_s,
total_throughput=(total_input + sum(output_lens)) / dur_s,

@ghostplant
Copy link
Author

ghostplant commented Mar 30, 2025

@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS.
So it is still below half of 5K.

look into the formula sglang used, we can see it should be (4000 * 30 + 780 * 30)/(30) = 4780 TPS. input_throughput=total_input / dur_s, output_throughput=sum(output_lens) / dur_s, total_throughput=(total_input + sum(output_lens)) / dur_s,

OK, I can use large prefill size and small decode size to get above 5K tokens. But considering a realistic thinking task requires small prefill size with large decode size, then I just get 0.8K, but I think it is not the case for the benchmark arguments you provided, whose numbers will focus on maximizing the numbers to report. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants