Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Commit

Permalink
update README (#165)
Browse files Browse the repository at this point in the history
  • Loading branch information
intellinjun authored Mar 29, 2024
1 parent b7babc7 commit 79c3537
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 13 deletions.
13 changes: 5 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,21 +79,18 @@ streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

### PyTorch Model from Modelscope
```python
import sys
from modelscope import AutoTokenizer
from transformers import TextStreamer
from neural_speed import Model

model_name = "qwen/Qwen1.5-7B-Chat" # modelscope model_id or local model
from modelscope import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "qwen/Qwen-7B" # Modelscope model_id or local model
prompt = "Once upon a time, there existed a little girl,"

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, model_hub="modelscope")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = Model()
model.init(model_name, weight_dtype="int4", compute_dtype="int8", model_hub="modelscope")
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
```

Expand Down
4 changes: 2 additions & 2 deletions docs/advanced_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Argument description of run.py ([supported MatMul combinations](#supported-matri
| --repeat_penalty | Penalize repeat sequence of tokens: Float (default: 1.1, 1.0 = disabled) |
| --color | Colorise output to distinguish prompt and user input from generations |
| --keep | Number of tokens to keep from the initial prompt: Int (default: 0, -1 = all) |
| --shift-roped-k | Use [ring-buffer](./docs/infinite_inference.md#shift-rope-k-and-ring-buffer) and thus do not re-computing after reaching ctx_size (default: False) |
| --shift-roped-k | Use [ring-buffer](./infinite_inference.md#shift-rope-k-and-ring-buffer) and thus do not re-computing after reaching ctx_size (default: False) |
| --token | Access token ID for models that require it (e.g: LLaMa2, etc..) |


Expand Down Expand Up @@ -119,6 +119,6 @@ Argument description of inference.py:
| --repeat_penalty | Penalize repeat sequence of tokens: Float (default: 1.1, 1.0 = disabled) |
| --color | Colorise output to distinguish prompt and user input from generations |
| --keep | Number of tokens to keep from the initial prompt: Int (default: 0, -1 = all) |
| --shift-roped-k | Use [ring-buffer](./docs/infinite_inference.md#shift-rope-k-and-ring-buffer) and thus do not re-computing after reaching ctx_size (default: False) |
| --shift-roped-k | Use [ring-buffer](./infinite_inference.md#shift-rope-k-and-ring-buffer) and thus do not re-computing after reaching ctx_size (default: False) |
| --glm_tokenizer | The path of the chatglm tokenizer: String (default: THUDM/chatglm-6b) |
| --memory-f32 <br> --memory-f16 <br> --memory-auto | Data type of kv memory (default to auto);<br>If set to auto, the runtime will try with bestla flash attn managed format (currently requires GCC11+ & AMX) and fall back to fp16 if failed |
4 changes: 2 additions & 2 deletions docs/fused_attention.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ Fused Attention

Attention (including MHA, GQA, MQA) is one of the key parts of transformers and also the performance critical in many scenarios. To implement various optimizations, a fused attention layer and corresponding utilities for the customized KV-cache it uses are introduced. As an example, fused attention can reduce the cost of MHA from 8521.276 ms (17.044 ms) to 248.586 ms (7.944 ms) of a 1975-token llama-7b first-token inference[[1]](#1).

Note that this doc assumes you have the basic knowledge of this cpp graph implementation including `ne_tensor::ne`, `ne_tensor::nb` etc. Model builder can enable fused attention with operators `ne_flash_attn*` defined in [`core/ne_layers.h`](core/ne_layers.h) while the implementation of fused attentions is mostly located in [`core/layers/mha_dense.cpp`](core/layers/mha_dense.cpp).
Note that this doc assumes you have the basic knowledge of this cpp graph implementation including `ne_tensor::ne`, `ne_tensor::nb` etc. Model builder can enable fused attention with operators `ne_flash_attn*` defined in [`core/ne_layers.h`](../neural_speed/core/ne_layers.h) while the implementation of fused attentions is mostly located in [`core/layers/mha_dense.cpp`](../neural_speed/core/layers/mha_dense.cpp).

## KV-cache initialization
The memory for kv-cache is allocated in `/models/model_utils/model_utils.cpp`. As the fused attention implementation requires certain instruction extensions and potentially some other limitations, fused attention is enabled only if `bestla_reordered_attn_fp32_support()` aggress, denoting with `memory_type = NE_TYPE_BTLA`. Next, `get_batch_kv_elements_from_gpt_params()` will give the sizes (in terms of bytes if fused attention enabled, or in terms elements if fused attention disabled) of k-cache and v-cache respectively for each batch and each layer. The KV-cache is finally prepared with these 2 sizes by creating `ne_new_tensor` inside `model.kv_self.k/.v` or `model.layer[il].k_cache/.v_cache`.
The memory for kv-cache is allocated in `neural_speed/models/model_utils/model_utils.cpp`. As the fused attention implementation requires certain instruction extensions and potentially some other limitations, fused attention is enabled only if `bestla_reordered_attn_fp32_support()` aggress, denoting with `memory_type = NE_TYPE_BTLA`. Next, `get_batch_kv_elements_from_gpt_params()` will give the sizes (in terms of bytes if fused attention enabled, or in terms elements if fused attention disabled) of k-cache and v-cache respectively for each batch and each layer. The KV-cache is finally prepared with these 2 sizes by creating `ne_new_tensor` inside `model.kv_self.k/.v` or `model.layer[il].k_cache/.v_cache`.

## KV-cache Append
KV-cache is appended every time a new pair of K and V are generated by evaluating inner product for QKV (ne_mul_qkv). This operation append an additional K/V-tensor on the dimension of sequence length (i.e. resulting `n_past = n_past + N`, where `n_past` is number of previously cached tokens and `N` is the length of current tokens).
Expand Down
2 changes: 1 addition & 1 deletion docs/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Neural Speed also supports GGUF models generated by [llama.cpp](https://github.c

Validated models: [llama2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), [falcon-7b](https://huggingface.co/tiiuae/falcon-7b), [falcon-40b](https://huggingface.co/tiiuae/falcon-40b), [mpt-7b](https://huggingface.co/mosaicml/mpt-7b), [mpt-40b](https://huggingface.co/mosaicml/mpt-40b) and [bloom-7b1](https://huggingface.co/bigscience/bloomz-7b1).

Please check more validated GGUF models from HuggingFace in [list](./docs/supported_models.md).
Please check more validated GGUF models from HuggingFace in [list](./supported_models.md).

## Examples

Expand Down

0 comments on commit 79c3537

Please sign in to comment.