From a10bad56371757bb60001d25b5fcf828e01c7e39 Mon Sep 17 00:00:00 2001 From: Nir David Date: Mon, 16 Dec 2024 16:55:17 +0200 Subject: [PATCH] fix CR comments --- docs/source/quantization/inc.rst | 46 +++++++++++++++----------------- 1 file changed, 22 insertions(+), 24 deletions(-) diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst index ff8cbb48a6b6d..4996fe0b6ad47 100644 --- a/docs/source/quantization/inc.rst +++ b/docs/source/quantization/inc.rst @@ -1,21 +1,22 @@ .. _INC: FP8 INC -================== +======= vLLM supports FP8 (8-bit floating point) weight and activation quantization using INC (Intel Neural Compressor) on hardware acceleration of Intel Gaudi (HPU). -Currently, only Llama models quntization are supported. +Currently, quantization is supported only for Llama models. Please visit the Intel Gaudi documentation of `Run Inference Using FP8 `_. -In order to run Inference it is required to have Measurements/Scales files: +In order to run inference it is required to have measurements/scales files: -Retrieve Measurements ---------------------- +Obtain Measurements +------------------- To obtain measurement files: -* Use the "inc" quantization method (as parameter to the LLM object). -* Call shutdown_inc and shutdown methods of the model_executor in the end of the run. +* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file `_ with MEASURE mode. +* Pass ``quantization=inc`` as parameter to the ``LLM`` object. +* Call ``shutdown_inc`` and ``shutdown`` methods of the ``model_executor`` at the end of the run. .. code-block:: python @@ -27,24 +28,23 @@ To obtain measurement files: llm.llm_engine.model_executor.shutdown_inc() llm.llm_engine.model_executor.shutdown() -.. note:: - - Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file `_ with MEASURE mode. - Run Inference Using FP8 ----------------------- -Intel Gaudi supports quantization of Linear Layers, KV-Cache and functions like Matmul and Softamx as shown in: +Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to: `Supported Modules `_. `Supported Functions `_. -In order to run Inference it requires to have Scales which located in scale files according to the `Json config file `_ dump_stats_path. -If none exist they can be generated during inference run using the measurement files (should be located in the same folder). +In order to run inference it requires to have Scales which located in scale files according to the `JSON config file `_ ``dump_stats_path``. +If none exist, they can be generated during inference run using the measurement files (should be located in the same folder). To run inference (and obtain scale files): -* Use the "inc" quantization method (as parameter to the LLM object). -* Use the "fp8_inc" kv cache dtype (as parameter to the LLM object). -* Call shutdown method of the model_executor in the end of the run. +* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file `_ with QUANTIZE mode. +* Pass ``quantization=inc`` as parameter to the ``LLM`` object. +* Pass ``fp8_inc`` as KV cache data type: + * Offline inference: pass ``kv_cache_dtype=fp8_inc`` as parameter to the ``LLM`` object. + * Online inference: pass ``--kv-cache-dtype=fp8_inc`` as command line parameter. +* Call shutdown method of the model_executor at the end of the run. .. code-block:: python @@ -55,17 +55,15 @@ To run inference (and obtain scale files): ... llm.llm_engine.model_executor.shutdown() -.. note:: - - Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file `_ with QUANTIZE mode. - Specifying Device for the Model's Weights Uploading --------------------------------------------------- -It is possible to upload the (unquantized) weights on a different device before qunantizing them -and moving to the device on which the model will run. -Use the weights_load_device parameter for the LLM object to specify this device. +It is possible to load the unquantized weights on a different device before quantizing them, +and moving to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory. +To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode. + .. code-block:: python + from vllm import LLM llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc", weights_load_device="cpu")