Xpu support #407

mgrabban · 2024-11-22T22:43:35Z

Summary

Replica of #396
Adds xpu support so all tests, benchmarks etc. run on XPUs or Intel GPUs.

Details

infer_device() function is moved to a separate file and in any file where previously "cuda" was needed, infer_device is imported and "cuda" is replaced with return value of a call to infer_device()

Testing Done

A100 80GB PCIe, RTX 3060, Intel Data Center GPU Max 1550

Hardware Type:
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

ByronHsu · 2024-11-22T23:02:34Z

can you fix the checkstyle?

mgrabban · 2024-11-23T03:47:26Z

can you fix the checkstyle?

done now.

## Summary Closes #411 1. The convergence tests all passed in the latest commit ([PR#407](#407)). Its CI worked fine: https://github.com/linkedin/Liger-Kernel/actions/runs/11983838113/job/33413899589?pr=407#step:5:984 2. Without any code changes inside Liger, the convergence tests now failed in QWEN2VL cases, referring to #411. The root cause of this is solely because huggingface released new transformers which modified QWEN2VL. Since it's not a bug within liger qwen2vl impl, it's okay to slightly adjust the `rtol`s a bit. BTW, seems there's some context maybe related: https://github.com/linkedin/Liger-Kernel/blob/0137757dcf769deac2b14646b7ab61374b8a58f6/test/convergence/test_mini_models.py#L530 ## Testing Done Yes. Full log below, ``` test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype0-1e-08-2e-05-0.0001-1e-05-0.005-1e-05] PASSED [ 5%] test/convergence/test_mini_models.py::test_mini_model[mini_llama3-32-0.0001-dtype1-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 11%] test/convergence/test_mini_models.py::test_mini_model[mini_mllama-32-0.0001-dtype2-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED [ 17%] test/convergence/test_mini_models.py::test_mini_model[mini_mllama-32-0.0001-dtype3-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 23%] test/convergence/test_mini_models.py::test_mini_model[mini_qwen2-32-0.0001-dtype4-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED [ 29%] test/convergence/test_mini_models.py::test_mini_model[mini_qwen2-32-0.0001-dtype5-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 35%] test/convergence/test_mini_models.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype6-8e-06-0.04-0.005-1e-05-0.005-1e-05] PASSED [ 41%] test/convergence/test_mini_models.py::test_mini_model[mini_qwen2_vl-32-0.0001-dtype7-0.001-0.05-0.1-0.01-0.01-0.01] PASSED [ 47%] test/convergence/test_mini_models.py::test_mini_model[mini_phi3-32-0.0001-dtype8-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED [ 52%] test/convergence/test_mini_models.py::test_mini_model[mini_phi3-32-0.0001-dtype9-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 58%] test/convergence/test_mini_models.py::test_mini_model[mini_mistral-32-0.0001-dtype10-1e-08-1e-05-0.005-1e-05-0.005-1e-05] PASSED [ 64%] test/convergence/test_mini_models.py::test_mini_model[mini_mistral-32-0.0001-dtype11-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 70%] test/convergence/test_mini_models.py::test_mini_model[mini_gemma1-32-0.0001-dtype12-1e-08-0.0001-0.005-1e-05-0.005-1e-05] PASSED [ 76%] test/convergence/test_mini_models.py::test_mini_model[mini_gemma1-32-0.0001-dtype13-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 82%] test/convergence/test_mini_models.py::test_mini_model[mini_gemma1.1-32-0.0001-dtype14-1e-08-0.0001-0.005-1e-05-0.005-1e-05] PASSED [ 88%] test/convergence/test_mini_models.py::test_mini_model[mini_gemma1.1-32-0.0001-dtype15-0.001-0.01-0.1-0.01-0.01-0.01] PASSED [ 94%] test/convergence/test_mini_models.py::test_mini_model[mini_gemma2-32-0.0001-dtype16-1e-08-0.0001-0.005-1e-05-0.005-1e-05] PASSED [100%] ================== 17 passed, 1 warning in 163.58s (0:02:43) =================== ``` - Hardware Type: A10G - [X] run `make test` to ensure correctness - [X] run `make checkstyle` to ensure code style - [X] run `make test-convergence` to ensure convergence Signed-off-by: Austin Liu <[email protected]>

mgrabban and others added 8 commits November 19, 2024 08:28

add xpu support

4ced042

add xpu support

e84dfe0

Merge branch 'main' into xpu-support

076c816

Merge branch 'main' into xpu-support

df7785d

add xpu support to simpo_loss

34d40a3

Merge branch 'main' into xpu-support

8059abc

main to xpu-support

90d7113

replace cuda with device for xpu support

1748f9c

mgrabban mentioned this pull request Nov 22, 2024

add xpu support #396

Closed

3 tasks

mgrabban added 2 commits November 22, 2024 19:43

remove old infer_device

d6a7b25

fix style

4dd47a3

ByronHsu approved these changes Nov 23, 2024

View reviewed changes

ByronHsu merged commit 7e3683e into main Nov 23, 2024
3 checks passed

ByronHsu deleted the xpu_support branch November 23, 2024 04:03

austin362667 mentioned this pull request Nov 29, 2024

Adjust QWEN2 VL Loss rtol #412

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xpu support #407

Xpu support #407

mgrabban commented Nov 22, 2024

ByronHsu commented Nov 22, 2024

mgrabban commented Nov 23, 2024

Xpu support #407

Xpu support #407

Conversation

mgrabban commented Nov 22, 2024

Summary

Details

Testing Done

ByronHsu commented Nov 22, 2024

mgrabban commented Nov 23, 2024