Skip to content

Commit

Permalink
Merge pull request #6 in wm_ai/autosmoothquant from model_support2 to…
Browse files Browse the repository at this point in the history
… master - <merge-MERGE #PR-6 ~fix baichuan 7B

>
  • Loading branch information
zhangpeng156 committed Feb 1, 2024
2 parents 735f96c + 23f57a8 commit 046359a
Show file tree
Hide file tree
Showing 9 changed files with 275 additions and 229 deletions.
42 changes: 24 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,27 +18,19 @@ pip install -e .
## Usage
### quantize model
First add a config file named "quant_config.json" to model path.
For Baichuan or Llama model, config should be like:
For currenttly supported models, config should be like:

```json
{
"qkv_proj": "per-tensor",
"o_proj": "per-tensor",
"gate_up_proj": "per-tensor",
"down_proj": "per-tensor"
}
```

As for Opt model, config should be like:

```json
{
"qkv_proj": "per-tensor",
"o_proj": "per-tensor",
"qkv": "per-tensor",
"out": "per-tensor",
"fc1": "per-tensor",
"fc2": "per-tensor"
}
```

"qkv" stands for QKV matmul of attention, "out" stands for out matmul of attention.
"fc1" and "fc2" are the layers of the FFNs, which might be referred to as "gate_up" and "down" in Llama-like models.
You can set the value to "per-tensor" or "per-token" to perform the quant granularity you want.

Once config is set, generate scales and do model quantization with following command:
Expand Down Expand Up @@ -72,10 +64,24 @@ Model support list:
| ---------| ----------------------------|
| LLaMA-2 | 7B/13B/70B |
| LLaMA | 7B/13B/30B/65B |
| Mistral | Soon |
| OPT | 6.7B/13B/30B |
| Baichuan-2 | 13B (7B Soon) |
| Baichuan | 13B (7B Soon) |
| Mixtral | 8*7B |
| OPT | 6.7B/13B/30B |
| Baichuan-2 | 7B/13B |
| Baichuan | 7B/13B |

## Performance and inference efficency
Detailed data comming soon

Cases:

[codellama-13b with A40](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1824133140). Tested with vLLM

[llama-13b with A100](https://github.com/vllm-project/vllm/pull/1508#issuecomment-1853826414). Tested with vLLM






## Reference
If you find SmoothQuant useful or relevant to your research, please cite their paper:
Expand Down
4 changes: 2 additions & 2 deletions autosmoothquant/examples/smoothquant_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ def parse_args():
help='where to save the act scales, activate when generating scales')
parser.add_argument("--scale-input", type=str, default='scales/llama-13b',
help='where to save the act scales, activate when quantizing models')
parser.add_argument('--num-samples', type=int, default=4)
parser.add_argument('--num-samples', type=int, default=512)
parser.add_argument('--seq-len', type=int, default=512)
parser.add_argument("--model-output", type=str, default='quantized_model/llama-13b',
help='where to save the quantized models, activate when quantizing models')
Expand Down Expand Up @@ -114,4 +114,4 @@ def main():
int8_model.save_pretrained(output_path)

if __name__ == '__main__':
main()
main()
Loading

0 comments on commit 046359a

Please sign in to comment.