[Doc] Update Release note and Known issues (#5394)

tye1 · web-flow · commit 85136cac955e · 2025-03-07T19:38:25.000+08:00
diff --git a/docs/tutorials/known_issues.md b/docs/tutorials/known_issues.md
@@ -11,17 +11,8 @@ Troubleshooting
     Optimization for Horovod\* at the end of the execution and triggers this error.
   - **Solution**: Do `import intel_extension_for_pytorch` before `import horovod.torch as hvd`.
 - **Problem**: Number of dpcpp devices should be greater than zero.
-  - **Cause**: If you use Intel® Extension for PyTorch\* in a conda environment, you might encounter this error. Conda also ships the libstdc++.so dynamic library file that may conflict with the one shipped
-    in the OS.
+  - **Cause**: If you use Intel® Extension for PyTorch\* in a conda environment, you might encounter this error. Conda also ships the libstdc++.so dynamic library file that may conflict with the one shipped in the OS.
   - **Solution**: Export the `libstdc++.so` file path in the OS to an environment variable `LD_PRELOAD`.
-- **Problem**: Symbol undefined caused by `_GLIBCXX_USE_CXX11_ABI`.
-    ```bash
-    ImportError: undefined symbol: _ZNK5torch8autograd4Node4nameB5cxx11Ev
-    ```
-  - **Cause**: Intel® Extension for PyTorch\* is compiled with `_GLIBCXX_USE_CXX11_ABI=1`. This symbol undefined issue appears when PyTorch\* is
-    compiled with `_GLIBCXX_USE_CXX11_ABI=0`.
-  - **Solution**: Pass `export GLIBCXX_USE_CXX11_ABI=1` and compile PyTorch\* with particular compiler which supports `_GLIBCXX_USE_CXX11_ABI=1`. We recommend using prebuilt wheels
-    in [download server](https://pytorch-extension.intel.com/release-whl/stable/xpu/us/) to avoid this issue.
 - **Problem**: `-997 runtime error` when running some AI models on Intel® Arc™ Graphics family.
   - **Cause**:  Some of the `-997 runtime error` are actually out-of-memory errors. As Intel® Arc™ Graphics GPUs have less device memory than Intel® Data Center GPU Flex Series 170 and Intel® Data Center GPU
     Max  Series, running some AI models on them may trigger out-of-memory errors and cause them to report failure such as `-997 runtime error` most likely. This is expected. Memory usage optimization is working in progress to allow Intel® Arc™ Graphics GPUs to support more AI models.
@@ -32,6 +23,9 @@ Troubleshooting
 - **Problem**: Some workloads terminate with an error `CL_DEVICE_NOT_FOUND` after some time on WSL2.
   - **Cause**:  This issue is due to the [TDR feature](https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys#tdrdelay) on Windows.
   - **Solution**: Try increasing TDRDelay in your Windows Registry to a large value, such as 20 (it is 2 seconds, by default), and reboot.
+- **Problem**: RuntimeError: Can't add devices across platforms to a single context. -33 (PI_ERROR_INVALID_DEVICE).
+  - **Cause**: If you run Intel® Extension for PyTorch\* in a Windows environment where Intel® discrete GPU and integrated GPU co-exist, and the integrated GPU is not supported by Intel® Extension for PyTorch\* but is wrongly identified as the first GPU platform.
+  - **Solution**: Disable the integrated GPU in your environment to work around. For long term, Intel® Graphics Driver will always enumerate the discrete GPU as the first device so that Intel® Extension for PyTorch\* could provide the fastest device to end framework users in such co-exist scenario based on that.
 
 ## Library Dependencies
 
@@ -118,13 +112,16 @@ Troubleshooting
     ```
 
 - **Problem**: ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
-                torch 2.6.0+xpu requires intel-cmplr-lib-rt==2025.0.2, but you have intel-cmplr-lib-rt 2025.0.4 which is incompatible.
-                torch 2.6.0+xpu requires intel-cmplr-lib-ur==2025.0.2, but you have intel-cmplr-lib-ur 2025.0.4 which is incompatible.
-                torch 2.6.0+xpu requires intel-cmplr-lic-rt==2025.0.2, but you have intel-cmplr-lic-rt 2025.0.4 which is incompatible.
-                torch 2.6.0+xpu requires intel-sycl-rt==2025.0.2, but you have intel-sycl-rt 2025.0.4 which is incompatible.
-  - **Cause**: The intel-extension-for-pytorch v2.6.10+xpu uses Intel Compiler 2025.0.4 for a distributed feature fix, while torch v2.6.0+xpu is pinned with 2025.0.2.
-  - **Solution**: Ignore the Error since actually torch v2.6.0+xpu is compatible with Intel Compiler 2025.0.4.
 
+  ```
+  torch 2.6.0+xpu requires intel-cmplr-lib-rt==2025.0.2, but you have intel-cmplr-lib-rt 2025.0.4 which is incompatible.
+  torch 2.6.0+xpu requires intel-cmplr-lib-ur==2025.0.2, but you have intel-cmplr-lib-ur 2025.0.4 which is incompatible.
+  torch 2.6.0+xpu requires intel-cmplr-lic-rt==2025.0.2, but you have intel-cmplr-lic-rt 2025.0.4 which is incompatible.
+  torch 2.6.0+xpu requires intel-sycl-rt==2025.0.2, but you have intel-sycl-rt 2025.0.4 which is incompatible.
+  ```
+  
+  - **Cause**: The intel-extension-for-pytorch v2.6.10+xpu uses Intel DPC++ Compiler 2025.0.4 to get a crucial bug fix in unified runtime, while torch v2.6.0+xpu is pinned with 2025.0.2.
+  - **Solution**: Ignore the Error since actually torch v2.6.0+xpu is compatible with Intel Compiler 2025.0.4.
 
 ## Performance Issue
 
diff --git a/docs/tutorials/releases.md b/docs/tutorials/releases.md
@@ -1,6 +1,49 @@
 Releases
 =============
 
+## 2.6.10+xpu
+
+Intel® Extension for PyTorch\* v2.6.10+xpu is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Max Series, Intel® Arc™ Graphics family, Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 Mobile Processors and Intel® Data Center GPU Flex Series) based on PyTorch* 2.6.0.
+
+### Highlights
+
+- Intel® oneDNN v3.7 integration
+- Official PyTorch 2.6 prebuilt binaries support
+
+  Starting this release, Intel® Extension for PyTorch\* supports official PyTorch prebuilt binaries, as they are built with `_GLIBCXX_USE_CXX11_ABI=1` since PyTorch\* 2.6 and hence ABI compatible with Intel® Extension for PyTorch\* prebuilt binaries which are always built with `_GLIBCXX_USE_CXX11_ABI=1`.
+  
+- Large Language Model (LLM) optimization
+
+  Intel® Extension for PyTorch\* provides support for a variety of custom kernels, which include commonly used kernel fusion techniques, such as `rms_norm` and `rotary_embedding`, as well as attention-related kernels like `paged_attention` and `chunked_prefill`, and `punica` kernel for serving multiple LoRA finetuned LLM. It also provides the MoE (Mixture of Experts) custom kernels including `topk_softmax`, `moe_gemm`, `moe_scatter`, `moe_gather`, etc. These optimizations enhance the functionality and efficiency of the ecosystem on Intel® GPU platform by improving the execution of key operations.
+
+  Besides that, Intel® Extension for PyTorch\* optimizes more LLM models for inference and finetuning, such as Phi3-vision-128k, phi3-small-128k, llama3.2-11B-vision, etc. A full list of optimized models can be found at [LLM Optimizations Overview](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/llm.html).
+
+- Serving framework support
+  
+  Intel® Extension for PyTorch\* offers extensive support for various ecosystems, including [vLLM](https://github.com/vllm-project/vllm) and [TGI](https://github.com/huggingface/text-generation-inference), with the goal of enhancing performance and flexibility for LLM workloads on Intel® GPU platforms (intensively verified on Intel® Data Center GPU Max Series and Intel® Arc™ B-Series graphics on Linux). The vLLM/TGI features like chunked prefill, MoE (Mixture of Experts) etc. are supported by the backend kernels provided in Intel® Extension for PyTorch*. The support to low precision such as Weight Only Quantization (WOQ) INT4 is also enhanced in this release:
+  -  The performance of INT4 GEMM kernel based on Generalized Post-Training Quantization (GPTQ) algorithm has been improved by approximately 1.3× compared with previous release. During the prefill stage, it achieves similar performance to FP16, while in the decode stage, it outperforms FP16 by approximately 1.5×.
+  -  The support of Activation-aware Weight Quantization (AWQ) algorithm is added and the performance is on par with GPTQ without g_idx.
+  
+- [Prototype] NF4 QLoRA finetuning using BitsAndBytes
+
+  Intel® Extension for PyTorch\* now supports QLoRA finetuning with BitsAndBytes on Intel® GPU platforms. It enables efficient adaptation of LLMs using NF4 4-bit quantization with LoRA, reducing memory usage while maintaining accuracy.
+
+- [Beta] Intel® Core™ Ultra Series 2 Mobile Processors support on Windows
+
+  Intel® Extension for PyTorch\* provides beta quality support of Intel® Core™ Ultra Series 2 Mobile Processors (codename Arrow Lake-H) on Windows in this release, based on redistributed PyTorch 2.6 prebuilt binaries with additional AOT compilation target for Arrow Lake-H in the [download server](https://pytorch-extension.intel.com/release-whl/stable/xpu/us/).
+  
+- Hybrid ATen operator implementation
+  
+  Intel® Extension for PyTorch\* uses ATen operators available in [Torch XPU Operators](https://github.com/intel/torch-xpu-ops) as much as possible and overrides very limited operators for better performance and broad data type support.
+
+### Breaking Changes
+
+- Intel® Data Center GPU Flex Series support is being deprecated and will no longer be available starting from the release after v2.6.10+xpu.
+
+### Known Issues
+
+Please refer to [Known Issues webpage](./known_issues.md).
+
 ## 2.5.10+xpu
 
 Intel® Extension for PyTorch\* v2.5.10+xpu is the new release which supports Intel® GPU platforms (Intel® Data Center GPU Max Series, Intel® Arc™ Graphics family, Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics and Intel® Data Center GPU Flex Series) based on PyTorch* 2.5.1.