diff --git a/ROCm_performance.md b/ROCm_performance.md index bea77d1a27fc4..1c47a818ec852 100644 --- a/ROCm_performance.md +++ b/ROCm_performance.md @@ -21,9 +21,23 @@ The custom PagedAttention kernel is enabled for dtype: fp16, block-size=16, head ## Fp8 Quantization -To use fp8 quantization, first step is to quantize your model to fp8 format. Please follow this [instruction](https://github.com/ROCm/vllm/tree/main/examples/fp8/quantizer) to generating a safetensor file that contains the quantized weights and the corresponding scaling factors of your model. The safetensor file should be placed under your model folder. +To use fp8 quantization, first step is to quantize your model to fp8 format. -Then we can run a model with fp8 quantization using vllm. When creating `vllm.LLM` object, two additional parameters should be added: `quantization="fp8"` and `quantization_param_path={relative path of the safetensors with your model path}`. +By default, rocm-vllm accepts the quantized weights generated by Quark quantizer. To do this, install quark and run the command: + +``` +python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \ + --output_dir output_dir \ + --quant_scheme w_fp8_a_fp8_o_fp8 \ + --num_calib_data 128 \ + --export_safetensors \ + --no_weight_matrix_merge +``` +For more details, please refer to Quark's documentation. + +To use ammo, please follow this [instruction](https://github.com/ROCm/vllm/tree/main/examples/fp8/quantizer), and set `VLLM_FP8_USE_AMMO=1`. + +Both quantizers generate a safetensor file that contains the quantized weights and the corresponding scaling factors of your model. The safetensor file should be placed under your model folder. Then we can run a model with fp8 quantization using vllm. When creating `vllm.LLM` object, two additional parameters should be added: `quantization="fp8"` and `quantization_param_path={relative path of the safetensors with your model path}`. ## Gemm Tuning for Fp8