2,823 tokens/s seems extremely high! #4

romitjain · 2024-06-18T09:17:56Z

Hey,

I think the tokens/s calculation might be incorrect. I can see that you are computing time by timing the CPU clock here: https://github.com/likejazz/llama3.cuda/blob/master/llama3.cu#L789

This might result in an incorrect number because the actual code is running on the GPU and the CPU is just dispatching the kernel. Hence, I suspect that the time that you are getting is CPU execution time of dispatching the kernel.

The correct way will be using cuda events.
Reference: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/

likejazz · 2024-06-23T10:42:08Z

@romitjain oh, thank you for the clarification. could you please send me a patch?

meneraing mentioned this issue Aug 28, 2024

Use CUDA event API for benchmarking #6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2,823 tokens/s seems extremely high! #4

2,823 tokens/s seems extremely high! #4

romitjain commented Jun 18, 2024

likejazz commented Jun 23, 2024

2,823 tokens/s seems extremely high! #4

2,823 tokens/s seems extremely high! #4

Comments

romitjain commented Jun 18, 2024

likejazz commented Jun 23, 2024