Merge pull request #6 in wm_ai/autosmoothquant from model_support2 to…

… master - <merge-MERGE #PR-6 ~fix baichuan 7B >
AniZpZ · Feb 1, 2024 · 046359a · 046359a
2 parents 735f96c + 23f57a8
commit 046359a
Show file tree

Hide file tree

Showing 9 changed files with 275 additions and 229 deletions.
diff --git a/README.md b/README.md
@@ -18,27 +18,19 @@ pip install -e .
 ## Usage
 ### quantize model
 First add a config file named "quant_config.json" to model path.
-For Baichuan or Llama model, config should be like:
+For currenttly supported models, config should be like:
 
 ```json
 {
-  "qkv_proj": "per-tensor",
-  "o_proj": "per-tensor",
-  "gate_up_proj": "per-tensor",
-  "down_proj": "per-tensor"
-}
-```
-
-As for Opt model, config should be like:
-
-```json
-{
-  "qkv_proj": "per-tensor",
-  "o_proj": "per-tensor",
+  "qkv": "per-tensor",
+  "out": "per-tensor",
   "fc1": "per-tensor",
   "fc2": "per-tensor"
 }
 ```
+
+"qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention.
+"fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models.
 You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want.
 
 Once config is set, generate scales and do model quantization with following command:
@@ -72,10 +64,24 @@ Model support list:
 | ---------| ----------------------------|
 | LLaMA-2  | 7B/13B/70B                  |
 | LLaMA    | 7B/13B/30B/65B              |
-| Mistral  | Soon                        |
-| OPT      | 6.7B/13B/30B |
-| Baichuan-2 | 13B (7B Soon)             |
-| Baichuan | 13B (7B Soon)               |
+| Mixtral  | 8*7B                        |
+| OPT      | 6.7B/13B/30B                |
+| Baichuan-2 | 7B/13B                    |
+| Baichuan | 7B/13B                      |
+
+## Performance and inference efficency
+Detailed data comming soon
+
+Cases:
+
+[codellama-13b with A40](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1824133140). Tested with vLLM
+
+[llama-13b with A100](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1853826414). Tested with vLLM
+
+
+
+
+
 
 ## Reference
 If you find SmoothQuant useful or relevant to your research, please cite their paper:

diff --git a/autosmoothquant/examples/smoothquant_model.py b/autosmoothquant/examples/smoothquant_model.py
@@ -28,7 +28,7 @@ def parse_args():
                         help='where to save the act scales, activate when generating scales')
     parser.add_argument("--scale-input", type=str, default='scales/llama-13b',
                         help='where to save the act scales, activate when quantizing models')
-    parser.add_argument('--num-samples', type=int, default=4)
+    parser.add_argument('--num-samples', type=int, default=512)
     parser.add_argument('--seq-len', type=int, default=512)
     parser.add_argument("--model-output", type=str, default='quantized_model/llama-13b',
                         help='where to save the quantized models, activate when quantizing models')
@@ -114,4 +114,4 @@ def main():
         int8_model.save_pretrained(output_path)
 
 if __name__ == '__main__':
-    main()
+    main()