Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2,823 tokens/s seems extremely high! #4

Open
romitjain opened this issue Jun 18, 2024 · 1 comment
Open

2,823 tokens/s seems extremely high! #4

romitjain opened this issue Jun 18, 2024 · 1 comment

Comments

@romitjain
Copy link

Hey,

I think the tokens/s calculation might be incorrect. I can see that you are computing time by timing the CPU clock here: https://github.com/likejazz/llama3.cuda/blob/master/llama3.cu#L789

This might result in an incorrect number because the actual code is running on the GPU and the CPU is just dispatching the kernel. Hence, I suspect that the time that you are getting is CPU execution time of dispatching the kernel.

The correct way will be using cuda events.
Reference: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/

@likejazz
Copy link
Owner

@romitjain oh, thank you for the clarification. could you please send me a patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants