Skip to content

Commit

Permalink
some more CR fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
nirda7 committed Jan 2, 2025
1 parent 3763c65 commit 717131b
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions docs/source/quantization/inc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@ Once you've completed the model calibration process and collected the measuremen
vllm serve meta-llama/Llama-3.1-405B-Instruct --quantization inc --kv-cache-dtype fp8_inc --weights-load-device cpu --tensor_paralel_size 8
.. tip::
If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments, as it causes a dramatic performance drop.
If you are just prototyping or testing your model with FP8, you can use the ``VLLM_SKIP_WARMUP=true`` environment variable to disable the warmup stage, which can take a long time. However, we do not recommend disabling this feature in production environments as it causes a significant performance drop.

.. tip::
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use these two environment variables:
When using FP8 models, you may experience timeouts caused by the long compilation time of FP8 operations. To mitigate this problem, you can use the below environment variables:
``VLLM_ENGINE_ITERATION_TIMEOUT_S`` - to adjust the vLLM server timeout. You can set the value in seconds, e.g., 600 equals 10 minutes.
``VLLM_RPC_TIMEOUT`` - to adjust the RPC protocol timeout used by the OpenAI-compatible API. This value is in microseconds, e.g., 600000 equals 10 minutes.

Expand All @@ -56,7 +56,7 @@ Specifying Device for the Model's Weights Uploading

It is possible to load the unquantized weights on a different device before quantizing them, then moving them to the device on which the model will run.
This reduces the device memory footprint of model weights, as only quantized weights are stored in device memory.
To set the load device, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter in online mode.
To set the device to upload weights, use the ``weights_load_device`` parameter for the ``LLM`` object, or ``--weights-load-device`` command line parameter when running online inference:

.. code-block:: python
Expand Down

0 comments on commit 717131b

Please sign in to comment.