add xpu support #396

mgrabban · 2024-11-19T17:42:22Z

Summary

Adds xpu support so all tests, benchmarks etc. run on XPUs or Intel GPUs.

Details

infer_device() function is moved to a separate file and in any file where previously "cuda" was needed, infer_device is imported and "cuda" is replaced with return value of a call to infer_device()

Testing Done

A100 80GB PCIe, RTX 3060, Intel Data Center GPU Max 1550

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

faaany · 2024-11-20T15:32:22Z

I ran the UTs on XPU, but got "Segmentation fault (core dumped)" at one test, under investigation.

mgrabban · 2024-11-20T16:06:30Z

I ran the UTs on XPU, but got "Segmentation fault (core dumped)" at one test, under investigation.

Which specific Intel GPU did you test on?
Also does the test run if you just change "cuda" to "xpu" manually without using this PR?

Also: I just added xpu support to simpo_loss (which was added later on and still had "cuda" hard coded).

faaany · 2024-11-21T07:44:42Z

I ran the UTs on XPU, but got "Segmentation fault (core dumped)" at one test, under investigation.

Which specific Intel GPU did you test on? Also does the test run if you just change "cuda" to "xpu" manually without using this PR?

Also: I just added xpu support to simpo_loss (which was added later on and still had "cuda" hard coded).

I use "Intel Data Center GPU Max 1550".

And I tested your latest code. All tests pass except "pytest -rA test/transformers/test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512]", but this one is a known issue and got fixed in the latest pytorch-triton-xpu. Don't you have this issue?

mgrabban · 2024-11-21T15:24:17Z

And I tested your latest code. All tests pass except "pytest -rA test/transformers/test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512]", but this one is a known issue and got fixed in the latest pytorch-triton-xpu. Don't you have this issue?

I don't have this issue. It could be because I use nightly intel-xpu-backend-for-triton.

test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                           [  3%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                           [  6%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-128-512] PASSED                                                                               [  9%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-5-123-123] PASSED                                                                               [ 12%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                           [ 15%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                           [ 18%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-128-512] PASSED                                                                               [ 21%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-5-123-123] PASSED                                                                               [ 25%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                             [ 28%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                             [ 31%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512] PASSED                                                                                 [ 34%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-5-123-123] PASSED                                                                                 [ 37%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                          [ 40%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                          [ 43%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-128-512] PASSED                                                                              [ 46%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-5-123-123] PASSED                                                                              [ 50%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                          [ 53%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                          [ 56%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-128-512] PASSED                                                                              [ 59%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-5-123-123] PASSED                                                                              [ 62%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                            [ 65%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                            [ 68%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512] PASSED                                                                                [ 71%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-5-123-123] PASSED                                                                                [ 75%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-2-8] PASSED                                                                         [ 78%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-9-7-41] PASSED                                                                        [ 81%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-2-8] PASSED                                                                             [ 84%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-9-7-41] PASSED                                                                            [ 87%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-2-8] PASSED                                                                         [ 90%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-9-7-41] PASSED                                                                        [ 93%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-2-8] PASSED                                                                             [ 96%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-9-7-41] PASSED                                                                            [100%]

faaany · 2024-11-22T00:23:34Z

And I tested your latest code. All tests pass except "pytest -rA test/transformers/test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512]", but this one is a known issue and got fixed in the latest pytorch-triton-xpu. Don't you have this issue?

I don't have this issue. It could be because I use nightly intel-xpu-backend-for-triton.

test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                           [  3%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                           [  6%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-128-512] PASSED                                                                               [  9%]
test_rms_norm.py::test_correctness[True-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-5-123-123] PASSED                                                                               [ 12%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                           [ 15%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                           [ 18%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-128-512] PASSED                                                                               [ 21%]
test_rms_norm.py::test_correctness[True-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-5-123-123] PASSED                                                                               [ 25%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                             [ 28%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                             [ 31%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512] PASSED                                                                                 [ 34%]
test_rms_norm.py::test_correctness[True-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-5-123-123] PASSED                                                                                 [ 37%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                          [ 40%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                          [ 43%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-128-512] PASSED                                                                              [ 46%]
test_rms_norm.py::test_correctness[False-LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-5-123-123] PASSED                                                                              [ 50%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                          [ 53%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                          [ 56%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-128-512] PASSED                                                                              [ 59%]
test_rms_norm.py::test_correctness[False-GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-5-123-123] PASSED                                                                              [ 62%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-2-128-512] PASSED                                                                            [ 65%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype0-0.0001-1e-06-5-123-123] PASSED                                                                            [ 68%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-2-128-512] PASSED                                                                                [ 71%]
test_rms_norm.py::test_correctness[False-BaseRMSNorm-0.0-none-dtype1-0.2-0.02-5-123-123] PASSED                                                                                [ 75%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-2-2-8] PASSED                                                                         [ 78%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype0-0.0001-1e-06-9-7-41] PASSED                                                                        [ 81%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-2-2-8] PASSED                                                                             [ 84%]
test_rms_norm.py::test_correctness_functional[LlamaRMSNorm-0.0-llama-dtype1-0.2-0.02-9-7-41] PASSED                                                                            [ 87%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-2-2-8] PASSED                                                                         [ 90%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype0-0.0001-1e-06-9-7-41] PASSED                                                                        [ 93%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-2-2-8] PASSED                                                                             [ 96%]
test_rms_norm.py::test_correctness_functional[GemmaRMSNorm-1.0-gemma-dtype1-0.2-0.02-9-7-41] PASSED                                                                            [100%]

Thanks for the update. This PR looks good to me.

ByronHsu · 2024-11-22T17:38:12Z

@faaany @mgrabban can you fix the conflict and we can merge this ASAP?

ByronHsu · 2024-11-22T19:53:30Z

Looks good to me! @mgrabban i just invited you as the collab of this repo, can you check the email? After acceptance, can you create a new branch in the main repo, and create a new PR based on that branch? Our CI has issues currently, so any PR from external folks cannot run CI. Thanks in advance!!

mgrabban · 2024-11-22T22:44:16Z

Looks good to me! @mgrabban i just invited you as the collab of this repo, can you check the email? After acceptance, can you create a new branch in the main repo, and create a new PR based on that branch? Our CI has issues currently, so any PR from external folks cannot run CI. Thanks in advance!!

This is done now. See #407

## Summary Replica of #396 Adds xpu support so all tests, benchmarks etc. run on XPUs or Intel GPUs. ## Details infer_device() function is moved to a separate file and in any file where previously "cuda" was needed, infer_device is imported and "cuda" is replaced with return value of a call to infer_device() ## Testing Done  A100 80GB PCIe, RTX 3060, Intel Data Center GPU Max 1550  - Hardware Type: <BLANK> - [x] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence --------- Co-authored-by: Shao Tang <[email protected]>

mgrabban and others added 4 commits November 19, 2024 08:28

add xpu support

4ced042

add xpu support

e84dfe0

Merge branch 'main' into xpu-support

076c816

Merge branch 'main' into xpu-support

df7785d

lancerts requested review from ByronHsu, shimizust and lancerts and removed request for ByronHsu November 19, 2024 22:51

add xpu support to simpo_loss

34d40a3

mgrabban added 2 commits November 22, 2024 11:49

Merge branch 'main' into xpu-support

8059abc

main to xpu-support

90d7113

replace cuda with device for xpu support

1748f9c

mgrabban mentioned this pull request Nov 22, 2024

Xpu support #407

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add xpu support #396

add xpu support #396

mgrabban commented Nov 19, 2024

faaany commented Nov 20, 2024

mgrabban commented Nov 20, 2024 •

edited

Loading

faaany commented Nov 21, 2024 •

edited

Loading

mgrabban commented Nov 21, 2024

faaany commented Nov 22, 2024

ByronHsu commented Nov 22, 2024

ByronHsu commented Nov 22, 2024

mgrabban commented Nov 22, 2024

add xpu support #396

Are you sure you want to change the base?

add xpu support #396

Conversation

mgrabban commented Nov 19, 2024

Summary

Details

Testing Done

faaany commented Nov 20, 2024

mgrabban commented Nov 20, 2024 • edited Loading

faaany commented Nov 21, 2024 • edited Loading

mgrabban commented Nov 21, 2024

faaany commented Nov 22, 2024

ByronHsu commented Nov 22, 2024

ByronHsu commented Nov 22, 2024

mgrabban commented Nov 22, 2024

mgrabban commented Nov 20, 2024 •

edited

Loading

faaany commented Nov 21, 2024 •

edited

Loading