Skip to content

Commit

Permalink
fix CR comments
Browse files Browse the repository at this point in the history
  • Loading branch information
nirda7 committed Dec 18, 2024
1 parent 0f6909e commit 9f21a57
Showing 1 changed file with 22 additions and 24 deletions.
46 changes: 22 additions & 24 deletions docs/source/quantization/inc.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,22 @@
.. _INC:

FP8 INC
==================
=======

vLLM supports FP8 (8-bit floating point) weight and activation quantization using INC (Intel Neural Compressor) on hardware acceleration of Intel Gaudi (HPU).
Currently, only Llama models quntization are supported.
Currently, quantization is supported only for Llama models.

Please visit the Intel Gaudi documentation of `Run Inference Using FP8 <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html>`_.

In order to run Inference it is required to have Measurements/Scales files:
In order to run inference it is required to have measurements/scales files:

Retrieve Measurements
---------------------
Obtain Measurements
-------------------

To obtain measurement files:
* Use the "inc" quantization method (as parameter to the LLM object).
* Call shutdown_inc and shutdown methods of the model_executor in the end of the run.
* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with MEASURE mode.
* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
* Call ``shutdown_inc`` and ``shutdown`` methods of the ``model_executor`` at the end of the run.

.. code-block:: python
Expand All @@ -27,24 +28,23 @@ To obtain measurement files:
llm.llm_engine.model_executor.shutdown_inc()
llm.llm_engine.model_executor.shutdown()
.. note::

Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with MEASURE mode.

Run Inference Using FP8
-----------------------

Intel Gaudi supports quantization of Linear Layers, KV-Cache and functions like Matmul and Softamx as shown in:
Intel Gaudi supports quantization of various modules and functions, including, but not limited to ``Linear``, ``KVCache``, ``Matmul`` and ``Softmax``. For more information, please refer to:
`Supported Modules <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-modules>`_.
`Supported Functions <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-functions>`_.

In order to run Inference it requires to have Scales which located in scale files according to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ dump_stats_path.
If none exist they can be generated during inference run using the measurement files (should be located in the same folder).
In order to run inference it requires to have Scales which located in scale files according to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ ``dump_stats_path``.
If none exist, they can be generated during inference run using the measurement files (should be located in the same folder).

To run inference (and obtain scale files):
* Use the "inc" quantization method (as parameter to the LLM object).
* Use the "fp8_inc" kv cache dtype (as parameter to the LLM object).
* Call shutdown method of the model_executor in the end of the run.
* Set the "QUANT_CONFIG" environment variable which points to the `JSON config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with QUANTIZE mode.
* Pass ``quantization=inc`` as parameter to the ``LLM`` object.
* Pass ``fp8_inc`` as KV cache data type:
* Offline inference: pass ``kv_cache_dtype=fp8_inc`` as parameter to the ``LLM`` object.
* Online inference: pass ``--kv-cache-dtype=fp8_inc`` as command line parameter.
* Call shutdown method of the model_executor at the end of the run.

.. code-block:: python
Expand All @@ -55,17 +55,15 @@ To run inference (and obtain scale files):
...
llm.llm_engine.model_executor.shutdown()
.. note::

Make sure to supply the "QUANT_CONFIG" environment variable which points to the `Json config file <https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html#supported-json-config-file-options>`_ with QUANTIZE mode.

Specifying Device for the Model's Weights Uploading
---------------------------------------------------

It is possible to upload the (unquantized) weights on a different device before qunantizing them
and moving to the device on which the model will run.
Use the weights_load_device parameter for the LLM object to specify this device.
It is possible to load the unquantized weights on a different device before quantizing them,
and moving to the device on which the model will run. This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode.

.. code-block:: python
from vllm import LLM
llm = LLM("llama3.1/Meta-Llama-3.1-8B-Instruct", quantization="inc", kv_cache_dtype="fp8_inc", weights_load_device="cpu")

0 comments on commit 9f21a57

Please sign in to comment.