InternLM · lvhan028 · Oct 13, 2023 · Sep 27, 2023 · Sep 27, 2023 · Sep 28, 2023
diff --git a/docs/en/w4a16.md b/docs/en/w4a16.md
@@ -62,10 +62,14 @@ Memory (GB) comparison results between 4-bit and 16-bit model with context size
 | Llama-2-7B-chat  | 15.1        | 6.3        | 16.2        | 7.5        |
 | Llama-2-13B-chat | OOM         | 10.3       | OOM         | 12.0       |
 
+```
+pip install nvidia-ml-py
+```
+
 ```shell
 python benchmark/profile_generation.py \
-  ./workspace \
-  --concurrency 1 --input_seqlen 1 --output_seqlen 512
+ --model-path ./workspace \
+ --concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
 ```
 
 ## 4-bit Weight Quantization

diff --git a/docs/zh_cn/w4a16.md b/docs/zh_cn/w4a16.md
@@ -60,10 +60,14 @@ python3 -m lmdeploy.serve.turbomind ./workspace --server_name {ip_addr} ----serv
 | Llama-2-7B-chat  | 15.1        | 6.3        | 16.2        | 7.5        |
 | Llama-2-13B-chat | OOM         | 10.3       | OOM         | 12.0       |
 
+```
+pip install nvidia-ml-py
+```
+
 ```shell
 python benchmark/profile_generation.py \
-  ./workspace \
-  --concurrency 1 --input_seqlen 1 --output_seqlen 512
+ --model-path ./workspace \
+ --concurrency 1 8 --prompt-tokens 1 512 --completion-tokens 2048 512
 ```
 
 ## 4bit 权重量化