Skip to content

[RFC] Extend bitsandbytes to support Intel hardware platforms #894

Closed
@jianan-gu

Description

@jianan-gu

Motivation

The current bitsandbytes library is bound with the CUDA platforms. However, we are seeing that there is a rapidly growing demand to run large language models (LLMs) on more platforms like Intel® CPUs and GPUs devices ("xpu" is the device tag for Intel GPU in PyTorch). Therefore, we aim at extending Intel® CPU and GPU ecosystem support and optimizations to bitsandbytes and offer the same scope of the lower-precision computation features (8bits and 4bits) as CUDA.

Approach

To provide the 8bits and 4bits features for Intel platforms, we propose two major changes as follows:

  1. A device abstraction that allows non-CUDA devices to be added to bitsandbytes easily. It contains a device backend abstraction that contains the key kernel interfaces to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.
  2. Lightweight enabling of Intel CPU and GPU support on top of the device abstraction. We plan to leverage the PyTorch 2.x compiler stack and the custom kernels provided by Intel Extension for PyTorch (IPEX) to support Intel CPU and GPU without the needs of upstreaming native backend codes. This reduces the complexity of adding new devices to bitsandbytes and also reduces maintenance costs.

Device abstraction

We will extend CUDA dependency to Intel CPU/GPU in bitsandbytes device setup and init. We will provide common device abstractions for general devices (there will be no changes on CUDA).
Note that there is also no API or usage change for Huggingface users to use different devices with bitsandbytes.

from transformers import BitsAndBytesConfig
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16,
)
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config)

Lightweight integration for Intel CPU/GPU

We will do lightweight and simple integration to enable low-precision computation features, both 8bits and 4bits. We don't plan to add native backend code for Intel CPU and GPU in the first step. Instead, we employ PyTorch 2.x compilation and Intel® Extension for PyTorch to enable those features.

  • For performance-critical functions, such as GEMM, we will import IPEX as a Python module and use its API for computation. IPEX can provide the best performance for such functions across Intel devices (for example by adopting the 4th generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions instruction set).
    • IPEX, as the speedup optional component, has already been integrated into Huggingface mainstream functionality tools like the Trainer Class and the accelerate repo to speed up training and inference. Similar to Huggingface, we will also integrate IPEX with bitsandbytes as a Python library dependency.

For example:

import intel_extension_for_pytorch
def cpu/xpu_igemmlt(*args, **kwargs):
    # Staff before computation
    C_i32 = torch.ops.torch_ipex.matmul_i8i8i32(A_i8, B_i8)  # GEMM computaion
    # Other staff
  • For other functions, we adopt the PyTorch 2.x compilation technology. We will implement them using PyTorch basic operators in Python and optimize the functions using torch.compile to get good performance. Intel is one of the major contributors to the torch.compile CPU backend in PyTorch and also hosted the torch.compile GPU backend in IPEX. The implementation can also work for other devices that support the PyTorch 2.x compiler stack.

For example:

@torch.compile
def double_quant_cpu/xpu(*args, **kwargs):
    # Implement double_quant for Intel CPU/GPU with PyTorch ops
    # torch.compile will generate kernel code and compile at runtime

Design

(1) Reorganize device_setup to support multiple devices

Intel CPU or GPU

  1. is_ipex_available
  2. import IPEX OPs (and also check Intel GPU device availability)

CUDA

  1. Remain the same, load from lib_cuda.so

1

(2) Device backend abstraction with key kernel interfaces

Key functions that are used in mainstream 8bits and 4bits:

  • Performance-critical:

| F.igemmlt |

  • Others:

| F.double_quant| F.mm_dequant| F.transform| F.extract_outliers| F.quantize_4bit| F.dequantize_4bit |

To extend the support of the above functions on Intel CPU/GPU (CUDA remains the same), we propose the following designs:
2

PR plans:

  • Enable device abstraction for Intel CPU/GPU and CUDA
    Adding options of init Intel CPU/GPU device but no implementations, CUDA remains the same.
  • Enable 8bits functionality for Intel CPU/GPU
    Adding implementations of 8bits functions for Intel CPU/GPU devices.
  • Enable 4bits functionality for Intel CPU/GPU
    Adding implementations of 4bits functions for Intel CPU/GPU devices.

Additional contents

Besides, we will also propose the PR to Transformers upstream to extend the usage of bitsandbytes API on multi-devices.

Transformers changes

  • _bitsandbytes_available
  • Not limited to CUDA devices available
  • Use CUDA/CPU and Intel GPU device here

3

Metadata

Metadata

Assignees

Labels

IntelRFCrequest for comments on proposed library improvements

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions