add performance tuning docs

SJTU-IPADS · Jan 11, 2024 · 4b2a4f4 · 4b2a4f4
1 parent 72e8511
commit 4b2a4f4
Show file tree

Hide file tree

Showing 2 changed files with 67 additions and 22 deletions.
diff --git a/README.md b/README.md
@@ -194,6 +194,9 @@ We also evaluated PowerInfer on a single RTX 2080Ti(11G) with INT4 ReLU models u
 
 Please refer to our [paper](https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf) for more evaluation details.
 
+## Docs
+- [Performance troubleshooting](./docs/token_generation_performance_tips.md)
+
 ## FAQs
 1. What if I encountered `CUDA_ERROR_OUT_OF_MEMORY`?
    - You can try to run with `--reset-gpu-index` argument to rebuild the GPU index for this model to avoid any stale cache.

diff --git a/docs/token_generation_performance_tips.md b/docs/token_generation_performance_tips.md
@@ -1,40 +1,82 @@
 # Token generation performance troubleshooting
 
-## Verifying that the model is running on the GPU with cuBLAS
-Make sure you compiled llama with the correct env variables according to [this guide](../README.md#cublas), so that llama accepts the `-ngl N` (or `--n-gpu-layers N`) flag. When running llama, you may configure `N` to be very large, and llama will offload the maximum possible number of layers to the GPU, even if it's less than the number you configured. For example:
+## Verifying that the model is running on the Nvidia GPU with cuBLAS
+
+Make sure to set `-DLLAMA_CUBLAS=ON` when configuring CMake according to [README](../README.md#build), and purge the previous `build` directory before reconfiguring and recompiling.
+
+When PowerInfer utilizes GPU, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:
+
 ```shell
-./main -m "path/to/model.gguf" -ngl 200000 -p "Please sir, may I have some "
+llm_load_sparse_model_tensors: using CUDA for GPU acceleration
+llm_load_sparse_model_tensors: mem required  = 16825.94 MB
+llm_load_sparse_model_tensors: VRAM used: 10183.80 MB
 ```
 
-When running llama, before it starts the inference work, it will output diagnostic information that shows whether cuBLAS is offloading work to the GPU. Look for these lines:
+If you see these lines, then the GPU is being used and model tensors are being loaded into VRAM.
+
+## Verifying that FFN split is working
+
+Ideally PowerInfer should be able to utilize full GPU memory or the VRAM budget you set. It first tries to offload dense layers to VRAM (attention, predictor, etc.), then it tries to split hot neurons of the FFN into VRAM if there is still space left.
+
+If can look for this line to see how much FFN has been split and offloaded:
+
 ```shell
-llama_model_load_internal: [cublas] offloading 60 layers to GPU
-llama_model_load_internal: [cublas] offloading output layer to GPU
-llama_model_load_internal: [cublas] total VRAM used: 17223 MB
-... rest of inference
+llm_load_gpu_split: offloaded 12577.50 MiB of FFN weights to GPU
 ```
 
-If you see these lines, then the GPU is being used.
+If you find that the VRAM usage is much lower than expected, then FFN split is likely not working. Splitting FFN requires solving the neuron placement via `powerinfer` Python module and loading the generated GPU index file, shown in the following lines.
+
+Solving (the result is cached, so this only happens once):
+```shell
+invoking powerinfer Python module to generate gpu split for 12738.39 MiB of VRAM
+solver args: Namespace(activation='/nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/activation', neuron=13824, capacity=429432, layer=40, vram_capacity=13357166592, batch=256, threshold=0, output='/nvme2/huggingf
+ace/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx')
+...
+exported GPU index to /nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx
+```
 
-## Verifying that the CPU is not oversaturated
-llama accepts a `-t N` (or `--threads N`) parameter. It's extremely important that this parameter is not too large. If your token generation is extremely slow, try setting this number to 1. If this significantly improves your token generation speed, then your CPU is being oversaturated and you need to explicitly set this parameter to the number of the physical CPU cores on your machine (even if you utilize a GPU). If in doubt, start with 1 and double the amount until you hit a performance bottleneck, then scale the number down.
+Loading generated or cached GPU index:
+```shell
+llama_model_loader: loaded meta data with 3 key-value pairs and 80 tensors from /nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
+llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 13824,     1,     1,     1 ]
+...
+apply_tensors_to_base_model: applying gpu_idx adapter from '/nvme2/huggingface/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF/llama-13b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
+```
+
+If you don't any of see these lines, then FFN split is not working. It can be caused by:
+
+- `powerinfer` Python module is not installed or not activated if you are using a virtual environment
+- There is no `activation` directory in the model directory which contains the activation files for solving FFN split
+
+Please refer to [Setup and Installation](../README.md#setup-and-installation) for more information on runtime dependencies and [Model Weights](../README.md#model-weights) for more information on model weights.
 
 # Example of runtime flags effect on inference speed benchmark
-These runs were tested on the following machine:
-GPU: A6000 (48GB VRAM)
-CPU: 7 physical cores
-RAM: 32GB
 
-Model: `TheBloke_Wizard-Vicuna-30B-Uncensored-GGML/Wizard-Vicuna-30B-Uncensored.q4_0.gguf` (30B parameters, 4bit quantization, GGML)
+Please refer to [Evaluation](../README.md#evaluation) for more information on token generation benchmark on Linux.
 
-Run command: `./main -m "path/to/model.gguf" -p "An extremely detailed description of the 10 best ethnic dishes will follow, with recipes: " -n 1000 [additional benchmark flags]`
+For Windows, we have tested [PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF](https://huggingface.co/PowerInfer/ReluLLaMA-13B-PowerInfer-GGUF) on a machine with the following specs:
+
+GPU: Nvidia RTX 2080Ti (11GB VRAM)
+CPU: Intel i7-13700
+RAM: 64GB DDR4 3200MHz
+
+Run command: `.\build\bin\Release\main.exe -m path\to\model -n 64 -p "Once upon a time" [additional benchmark flags]`
 
 Result:
 
 | command | tokens/second (higher is better) |
 | - | - |
-| -ngl 2000000 | N/A (less than 0.1) |
-| -t 7 | 1.7 |
-| -t 1 -ngl 2000000 | 5.5 |
-| -t 7 -ngl 2000000 | 8.7 |
-| -t 4 -ngl 2000000 | 9.1 |
+| [no additional flags] | 4.05 |
+| -t 8 | 4.27 |
+
+# CPU affinity and hybrid architecture (big.LITTLE)
+
+PowerInfer achieves the best performance when it is running on all CPU performance cores (P cores). On a hybrid architecture such as Intel 12th/13th Gen (Alder Lake), we recommend setting `-t --threads` with the available number of P cores. 
+
+Windows sometimes are not able to schedule threads on P cores. If you find the token generation speed is unstable, or the utilization of P cores is low, you can try to set CPU affinity manually with `Start-Process` in PowerShell like this exmaple on 12th Gen Core i7 (8 P cores):
+
+```ps
+Start-Process -FilePath path\to\main.exe -ArgumentList "-m", "path\to\model", "-t", "8", "-n", "128", "-p", "`"Once upon a time`"" -NoNewWindow -PassThru -Wait | ForEach-Object { $_.ProcessorAffinity = 0x5555 }
+```
+
+It works like `taskset` on Linux and sets CPU affinity to P cores only (0x5555 is a bit mask for CPU0,2,4,6,8,10,12,14). Please refer to the docs of [Start-Process](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.management/start-process?view=powershell-7.4) for more details.