[RFC] Extend bitsandbytes to support Intel hardware platforms

# Motivation

The current bitsandbytes library is bound with the CUDA platforms. However, we are seeing that there is a rapidly growing demand to run large language models (LLMs) on more platforms like Intel® CPUs and GPUs devices ("xpu" is the device tag for Intel GPU in PyTorch). Therefore, we aim at extending Intel® CPU and GPU ecosystem support and optimizations to bitsandbytes and offer the same scope of the lower-precision computation features (8bits and 4bits) as CUDA.

# Approach
To provide the 8bits and 4bits features for Intel platforms, we propose two major changes as follows:

1. A device abstraction that allows non-CUDA devices to be added to bitsandbytes easily. It contains a device backend abstraction that contains the key kernel interfaces to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.
2. Lightweight enabling of Intel CPU and GPU support on top of the device abstraction. We plan to leverage the PyTorch 2.x compiler stack and the custom kernels provided by Intel Extension for PyTorch (IPEX) to support Intel CPU and GPU without the needs of upstreaming native backend codes. This reduces the complexity of adding new devices to bitsandbytes and also reduces maintenance costs.

## Device abstraction
We will extend CUDA dependency to Intel CPU/GPU in bitsandbytes device setup and init. We will provide common device abstractions for general devices (there will be no changes on CUDA).
Note that there is also no API or usage change for Huggingface users to use different devices with bitsandbytes.
```python
from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)
```

## Lightweight integration for Intel CPU/GPU
We will do lightweight and simple integration to enable low-precision computation features, both 8bits and 4bits. We don't plan to add native backend code for Intel CPU and GPU in the first step. Instead, we employ PyTorch 2.x compilation and Intel® Extension for PyTorch to enable those features.

- For performance-critical functions, such as GEMM, we will import IPEX as a Python module and use its API for computation. IPEX can provide the best performance for such functions across Intel devices (for example by adopting the 4th generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions instruction set).
  - IPEX, as the speedup optional component, has already been integrated into Huggingface mainstream functionality tools like the Trainer Class and the accelerate repo to speed up training and inference. Similar to Huggingface, we will also integrate IPEX with bitsandbytes as a Python library dependency.

For example:
```python
import intel_extension_for_pytorch
def cpu/xpu_igemmlt(*args, **kwargs):
    # Staff before computation
    C_i32 = torch.ops.torch_ipex.matmul_i8i8i32(A_i8, B_i8)  # GEMM computaion
    # Other staff
```

- For other functions, we adopt the PyTorch 2.x compilation technology. We will implement them using PyTorch basic operators in Python and optimize the functions using [torch.compile](https://pytorch.org/docs/stable/generated/torch.compile.html) to get good performance. Intel is one of the major contributors to the torch.compile CPU backend  in PyTorch and also hosted the torch.compile GPU backend in IPEX. The implementation can also work for other devices that support the PyTorch 2.x compiler stack.

For example:
```python
@torch.compile
def double_quant_cpu/xpu(*args, **kwargs):
    # Implement double_quant for Intel CPU/GPU with PyTorch ops
    # torch.compile will generate kernel code and compile at runtime
```


# Design

## (1) Reorganize device_setup to support multiple devices

_Intel CPU or GPU_
1. is_ipex_available
2. import IPEX OPs (and also check Intel GPU device availability)

_CUDA_
1. Remain the same, load from lib_cuda.so

![1](https://github.com/TimDettmers/bitsandbytes/assets/83276252/837b1131-f0c3-4291-8be3-5220e22135df)

## (2) Device backend abstraction with key kernel interfaces
Key functions that are used in mainstream 8bits and 4bits:

- Performance-critical:

| F.igemmlt |

- Others:

| F.double_quant| F.mm_dequant| F.transform| F.extract_outliers| F.quantize_4bit| F.dequantize_4bit |

To extend the support of the above functions on Intel CPU/GPU (CUDA remains the same), we propose the following designs:
![2](https://github.com/TimDettmers/bitsandbytes/assets/83276252/7b94577b-092c-463a-9414-3e7fe8aa399e)

# PR plans:

- [x] Enable device abstraction for Intel CPU/GPU and CUDA
       Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.
- [ ] Enable 8bits functionality for Intel CPU/GPU
       Adding implementations of 8bits functions for Intel CPU/GPU devices.
- [ ] Enable 4bits functionality for Intel CPU/GPU
       Adding implementations of 4bits functions for Intel CPU/GPU devices.

# Additional contents
Besides, we will also propose the PR to Transformers upstream to extend the usage of bitsandbytes API on multi-devices.

**Transformers changes**
- _bitsandbytes_available 
- Not limited to CUDA devices available
- Use CUDA/CPU and Intel GPU device [here](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/bitsandbytes.py#L95-L98)

![3](https://github.com/TimDettmers/bitsandbytes/assets/83276252/cb0e7dee-9831-4f09-9a93-2ef390204a8f)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

Motivation

Approach

Device abstraction

Lightweight integration for Intel CPU/GPU

Design

(1) Reorganize device_setup to support multiple devices

(2) Device backend abstraction with key kernel interfaces

PR plans:

Additional contents

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

Description

Motivation

Approach

Device abstraction

Lightweight integration for Intel CPU/GPU

Design

(1) Reorganize device_setup to support multiple devices

(2) Device backend abstraction with key kernel interfaces

PR plans:

Additional contents

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions