fix CR comments

HabanaAI · Dec 18, 2024 · 9f21a57 · 9f21a57
1 parent 0f6909e
commit 9f21a57
Showing 1 changed file with 22 additions and 24 deletions.
diff --git a/docs/source/quantization/inc.rst b/docs/source/quantization/inc.rst
@@ -1,21 +1,22 @@
 .. _INC:
 
 FP8 INC
-==================
+=======
 
 vLLM supports FP8 (8-bit floating point) weight and activation quantization using INC (Intel Neural Compressor) on hardware acceleration of Intel Gaudi (HPU).
-Currently, only Llama models quntization are supported.
+Currently, quantization is supported only for Llama models.
 
 Please visit the Intel Gaudi documentation of `Run Inference Using FP8  <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html>`_.
 
-In order to run Inference it is required to have Measurements/Scales files:
+In order to run inference it is required to have measurements/scales files:
 
-Retrieve Measurements
----------------------
+Obtain Measurements
+-------------------
 
 To obtain measurement files:
-* Use the "inc" quantization method (as parameter to the LLM object).
-* Call shutdown_inc and shutdown methods of the model_executor in the end of the run.
+* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with MEASURE mode.
+* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
+* Call ``shutdown_inc`` and ``shutdown`` methods of the ``model_executor`` at the end of the run.
 
 .. code-block:: python
 
@@ -27,24 +28,23 @@ To obtain measurement files:
     llm.llm_engine.model_executor.shutdown_inc()
     llm.llm_engine.model_executor.shutdown()
 
-.. note::
-
-   Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with MEASURE mode.
-
 Run Inference Using FP8
 -----------------------
 
-Intel Gaudi supports quantization of Linear Layers, KV-Cache and functions like Matmul and Softamx as shown in:
+Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to:
 `Supported Modules <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-modules>`_.
 `Supported Functions <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-functions>`_.
 
-In order to run Inference it requires to have Scales which located in scale files according to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ dump_stats_path.
-If none exist they can be generated during inference run using the measurement files (should be located in the same folder).
+In order to run inference it requires to have Scales which located in scale files according to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ ``dump_stats_path``.
+If none exist, they can be generated during inference run using the measurement files (should be located in the same folder).
 
 To run inference (and obtain scale files):
-* Use the "inc" quantization method (as parameter to the LLM object).
-* Use the "fp8_inc" kv cache dtype (as parameter to the LLM object).
-* Call shutdown method of the model_executor in the end of the run.
+* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with QUANTIZE mode.
+* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
+* Pass ``fp8_inc`` as KV cache data type:
+   * Offline inference: pass ``kv_cache_dtype=fp8_inc`` as parameter to the ``LLM`` object. 
+   * Online inference: pass ``--kv-cache-dtype=fp8_inc`` as command line parameter.
+* Call shutdown method of the model_executor at the end of the run.
 
 .. code-block:: python
 
@@ -55,17 +55,15 @@ To run inference (and obtain scale files):
     ...
     llm.llm_engine.model_executor.shutdown()
 
-.. note::
-
-    Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with QUANTIZE mode.
-
 Specifying Device for the Model's Weights Uploading
 ---------------------------------------------------
 
-It is possible to upload the (unquantized) weights on a different device before qunantizing them 
-and moving to the device on which the model will run.
-Use the weights_load_device parameter for the LLM object to specify this device.
+It is possible to load the unquantized weights on a different device before quantizing them,
+and moving to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
+To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode.
+
 .. code-block:: python
+
     from vllm import LLM
     llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc", weights_load_device="cpu")