-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not able to reproduce R1 throughput 5x against H200 #245
Comments
It is exactly what I tried, but I just get 600 TPS using concurrency = 32. Is there anything wrong on my evaluation? |
@andyluo7 could you please give some comments |
@ghostplant, you should be able to reproduce following https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html. If not, please send how you ran the benchmark to [email protected]. |
What I get is 780 TPS using bsz = 32, which is far from 5000 TPS |
I was able to repro the numbers using a different setup.. @ghostplant what's your MI300x CSP? |
wait.. are you talking about output tokens or all tokens? 5k TPS should be the latter. |
@sunway513 Hi, yes 780 for output tokens (assume 30 sec to complete), and I also see 4000 for input tokens (assume 30 sec to complete as well). If calculating all tokens, I think the overall TPS should be total tokens / total times (60 sec), which is: (4000 * 30 + 780 * 30)/(30 + 30) = 2390 TPS. So it is still below half of 5K. |
Why do you need to calculate it? if you go with the demonstrated sglang command, you should be able to find the perf metrics in the output. cc @andyluo7 to correct me if I'm wrong. |
Because the gap with SLANG's realistic speed (780 TPs) is huge. So I have no idea why it prints a very large value that realistic is not achievable? |
look into the formula sglang used, we can see it should be (4000 * 30 + 780 * 30)/(30) = 4780 TPS. |
OK, I can use large prefill size and small decode size to get above 5K tokens. But considering a realistic thinking task requires small prefill size with large decode size, then I just get 0.8K, but I think it is not the case for the benchmark arguments you provided, whose numbers will focus on maximizing the numbers to report. Thanks! |
I see Aiter accelerated DeepSeek R1 has been 5x throughput of H200, what's the proper setting to reproduce it?
The text was updated successfully, but these errors were encountered: