[Bugfix][Kernel]Fix incorrect output tokens when running ChatGLM-9b inferencing on MI250 #312

jundali2 · 2024-12-09T06:38:04Z

[Issue Description]
We encountered random and incorrect output tokens when running ChatGLM-9b inferencing with latest rocm/vllm release on MI250.

[Root Cause]
The vectorized rms_norm_kernel uses the same data type as input tensors (like FP16) to store intermediate results, which causes precision lost and leads to incorrect output tokens.

[Solution]
Use float type to store intermediate results and get correct output tokens.
We compared the outputs from different kernels w/ PyTorch's standard op and verified the final output token correctness, below are the test results:

	Mismatched elements	Greatest absolute difference	Greatest relative difference	Output Tokens	Speed(us)
Non-Vectorized Kernel	1040 / 16777216 (0.0%)	0.000488281	0.001631737	Correct	28.9972
Original Vectorized Kernel	16283524 / 16777216 (97.1%)	0.001953125	0.004055023	In-correct	28.8724
Vectorized Kernel w/ Fix	3840603 / 16777216 (22.9%)	0.000488281	0.000976563	Correct	28.7410

We can see after the fix is applied, without impacting on performance, the "mismatched elements" has reduced significantly, and both the absolute & relative differences are close or even better than non-vectorized kernel, final output tokens are correct as well.

Use Float Type to Store Intermediate Result to Avoid Precision Lost

cfdf08e

jundali2 closed this by deleting the head repository Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][Kernel]Fix incorrect output tokens when running ChatGLM-9b inferencing on MI250 #312

[Bugfix][Kernel]Fix incorrect output tokens when running ChatGLM-9b inferencing on MI250 #312

jundali2 commented Dec 9, 2024

[Bugfix][Kernel]Fix incorrect output tokens when running ChatGLM-9b inferencing on MI250 #312

[Bugfix][Kernel]Fix incorrect output tokens when running ChatGLM-9b inferencing on MI250 #312

Conversation

jundali2 commented Dec 9, 2024