Skip to content

Commit

Permalink
Update the llama.cpp integration
Browse files Browse the repository at this point in the history
  • Loading branch information
rlouf committed Apr 10, 2024
1 parent 121a25c commit aacc633
Show file tree
Hide file tree
Showing 17 changed files with 740 additions and 379 deletions.
3 changes: 3 additions & 0 deletions docs/reference/models/exllamav2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# ExllamaV2

*Coming soon*
212 changes: 197 additions & 15 deletions docs/reference/models/llamacpp.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,31 @@
# Llama.cpp

!!! Installation
Outlines provides an integration with [Llama.cpp](https://github.com/ggerganov/llama.cpp) using the [llama-cpp-python library][llamacpp]. Llamacpp allows to run quantized models on machines with limited compute.

You need to install the `llama-cpp-python` library to be able to use these models in Outlines.
!!! Note "Installation"

Outlines provides an integration with [Llama.cpp](https://github.com/ggerganov/llama.cpp) using the [llama-cpp-python library][llamacpp]. Llamacpp allows to run quantized models on machines with limited compute.
You need to install the `llama-cpp-python` library to use the llama.cpp integration. See the [installation section](#installation) for instructions to install `llama-cpp-python` with CUDA, Metal, ROCm and other backends.

## Load the model

You can initialize the model by passing the name of the repository on the HuggingFace Hub, and the filenames (or glob pattern):

```python
from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
```

This will download the model files to the hub cache folder and load the weights in memory.

You can initialize the model by pasing the path to the weights on your machine. Assuming [Phi2's weights](https://huggingface.co/TheBloke/phi-2-GGUF) are in the current directory:
You can also initialize the model by passing the path to the weights on your machine. Assuming [Phi2's weights](https://huggingface.co/TheBloke/phi-2-GGUF) are in the current directory:

```python
from outlines import models
from llama_cpp import Llama

model = models.llamacpp("./phi-2.Q4_K_M.gguf", device="cuda")
llm = Llama("./phi-2.Q4_K_M.gguf")
model = models.llamacpp(llm)
```

If you need more control, you can pass the same keyword arguments to the model as you would pass in the [llama-ccp-library][llamacpp]:
Expand All @@ -20,25 +34,193 @@ If you need more control, you can pass the same keyword arguments to the model a
from outlines import models

model = models.llamacpp(
"./phi-2.Q4_K_M.gguf",
n_gpu_layers=-1, # to use GPU acceleration
seed=1337, # to set a specific seed
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf"
n_ctx=512, # to set the context length value
)
```

Please see the [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/) for a list of available keyword arguments. Finally, if for some reason you would like to initialize `llama_cpp.Llama` separately, you can convert it to an Outlines model using:
**Main parameters:**

| Parameters | Type | Description | Default |
|------------|------|-------------|---------|
| `n_gpu_layers`| `int` | Number of layers to offload to GPU. If -1, all layers are offloaded | `0` |
| `split_mode` | `int` | How to split the model across GPUs. `1` for layer-wise split, `2` for row-wise split | `1` |
| `main_gpu` | `int` | Main GPU | `0` |
| `tensor_split` | `Optional[List[float]]` | How split tensors should be distributed accross GPUs. If `None` the model is not split. | `None` |
| `n_ctx` | `int` | Text context. Inference from the model if set to `0` | `0` |
| `n_threads` | `Optional[int]` | Number of threads to use for generation. All available threads if set to `None`.| `None` |
| `verbose` | `bool` | Print verbose outputs to `stderr` | `False` |

See the [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for the full list of parameters.

### Load the model on GPU

!!! Note

[Make sure](#cuda) that you installed `llama-cpp-python` with GPU support.

To load the model on GPU, pass `n_gpu_layers=-1`:

```python
from llama_cpp import Llama
from outlines import models

llm = Llama.from_pretrained(
repo_id="Qwen/Qwen1.5-0.5B-Chat-GGUF",
filename="*q8_0.gguf",
verbose=False
model = models.llamacpp(
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf"
n_gpu_layers=-1, # to use GPU acceleration
)
model = models.LlamaCpp(llm)
```

This also works with generators built with `generate.regex`, `generate.json`, `generate.cfg`, `generate.format` and `generate.choice`.

### Load LoRA adapters

You can load LoRA adapters dynamically:

```python
from outlines import models, generate

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)
answer_1 = generator("prompt")

model.load_lora("./path/to/adapter.gguf")
answer_2 = generator("prompt")
```

To load another adapter you need to re-initialize the model. Otherwise the adapter will be added on top of the previous one:

```python
from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter1.gguf") # Load first adapter

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter2.gguf") # Load second adapter
```

## Generate text

In addition to the parameters described in the [text generation section](../text.md) you can pass extra keyword arguments, for instance to set sampling parameters not exposed in Outlines' public API:

```python
from outlines import models, generate


model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)

answer = generator("A prompt", presence_penalty=0.8)
```

**Extra keyword arguments:**

The value of the keyword arguments you pass to the generator suspersede the values set when initializing the sampler or generator. All extra sampling methods and repetition penalties are disabled by default.

| Parameters | Type | Description | Default |
|------------|------|-------------|---------|
| `suffix` | `Optional[str]` | A suffix to append to the generated text. If `None` no suffix is added. | `None` |
| `echo` | `bool` | Whether to preprend the prompt to the completion. | `False` |
| `seed` | `int` | The random seed to use for sampling. | `None` |
| `max_tokens` | `Optional[int]` | The maximum number of tokens to generate. If `None` the maximum number of tokens depends on `n_ctx`. | `16` |
| `frequence_penalty` | `float` | The penalty to apply to tokens based on their frequency in the past 64 tokens. | `0.0` |
| `presence_penalty` | `float` | The penalty to apply to tokens based on their presence in the past 64 tokens. | `0.0` |
| `repeat_penalty` | `float` | The penalty to apply to repeated tokens in the past 64 tokens. | `1.` |
| `stopping_criteria` | `Optional[StoppingCriteriaList]` | A list of stopping criteria to use. | `None`
| `logits_processor` | `Optional[LogitsProcessorList]` | A list of logits processors to use. The logits processor used for structured generation will be added to this list. | `None`
| `temperature` | `float` | The temperature to use for sampling | `1.0` |
| `top_p` | `float` | The top-p value to use for [nucleus sampling][degeneration]. | `1.` |
| `min_p` | `float` | The min-p value to use for [minimum-p sampling][minimum-p]. | `0.` |
| `typical_p` | `float` | The p value to use for [locally typical sampling][locally-typical]. | `1.0` |
| `stop` | `Optional[Union[str, List[str]]]` | A list of strings that stop generation when encountered. | `[]` |
| `top_k` | `int` | The top-k value used for [top-k sampling][top-k]. Negative value to consider all logit values. | `-1.` |
| `tfs_z` | `float` | The [tail-free sampling][tail-free] parameter. | `1.0` |
| `mirostat_mode` | `int` | The [mirostat sampling][mirostat] mode. | `0` |
| `mirostat_tau` | `float` | The target cross-entropy for [mirostat sampling][mirostat].| `5.0` |
| `mirostat_eta` | `float` | The learning rate used to update `mu` in [mirostat sampling][mirostat]. | `0.1` |

See the [llama-cpp-python documentation][llama-cpp-python-call] for the full and up-to-date list of parameters and the [llama.cpp code][llama-cpp-sampling-params] for the default values of other
sampling parameters.

### Streaming


## Installation

You need to install the `llama-cpp-python` library to use the llama.cpp integration.

### CPU

For a *CPU-only* installation run:

```bash
pip install llama-cpp-python
```

!!! Warning

Do not run this command if you want support for BLAS, Metal or CUDA. Follow the instructions below instead.

### CUDA

```bash
CMAKE_ARGS="-DLLAMA_CUDA=on" pip install llama-cpp-python
```

It is also possible to install pre-built wheels with CUDA support (Python 3.10 and above):

```bash
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/<cuda-version>
```

Where `<cuda-version>` is one of the following, depending on the version of CUDA installed on your system:

- `cu121` for CUDA 12.1
- `cu122` for CUDA 12.2
- `cu123` CUDA 12.3

### Metal

```bash
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python
```

It is also possible to install pre-build wheels with Metal support (Python 3.10 or above, MacOS 11.0 and above):

```bash
pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/metal
```

### OpenBLAS

```bash
CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-python
```

### Other backend

`llama.cpp` supports many other backends. Refer to the [llama.cpp documentation][llama-cpp-python-install] to use the following backends:

- CLBast (OpenCL)
- hipBLAS (ROCm)
- Vulkan
- Kompute
- SYCL




[llamacpp]: https://github.com/abetlen/llama-cpp-python
[llama-cpp-python-call]: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__
[llama-cpp-python-install]: https://github.com/abetlen/llama-cpp-python/tree/08b16afe11e7b42adec2fed0a781123383476045?tab=readme-ov-file#supported-backends
[llama-cpp-sampling-params]: https://github.com/ggerganov/llama.cpp/blob/e11a8999b5690f810c2c99c14347f0834e68c524/common/sampling.h#L22
[mirostat]: https://arxiv.org/abs/2007.14966
[degeneration]: https://arxiv.org/abs/1904.09751
[top-k]: https://arxiv.org/abs/1805.04833
[minimum-p]: https://github.com/ggerganov/llama.cpp/pull/3841
[locally-typical]: https://arxiv.org/abs/2202.00666
[tail-free]: https://www.trentonbricken.com/Tail-Free-Sampling
3 changes: 3 additions & 0 deletions docs/reference/models/mamba.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Mamba

*Coming soon*
81 changes: 76 additions & 5 deletions docs/reference/models/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

!!! Note "Installation"

You need to install the `vllm` library to use the vLLM integration.
You need to install the `vllm` library to use the vLLM integration. See the [installation section](#installation) for instructions to install vLLM for CPU or ROCm.

## Load the model

Expand Down Expand Up @@ -32,7 +32,7 @@ Models are loaded from the [HuggingFace hub](https://huggingface.co/).

!!! Warning "Device"

vLLM models can only be loaded on GPU.
The default installation of vLLM only allows to load models on GPU. See the [installation instructions](#installation) to run models on CPU.


You can pass any parameter that you would normally pass to `vllm.LLM`, as keyword arguments:
Expand Down Expand Up @@ -98,7 +98,7 @@ model.load_lora(None) # Unload LoRA adapter

## Generate text

In addition to the parameters described in the [text generation section](text.md) you can pass an instance of `SamplingParams` directly to any generator via the `sampling_params` keyword argument:
In addition to the parameters described in the [text generation section](../text.md) you can pass an instance of `SamplingParams` directly to any generator via the `sampling_params` keyword argument:

```python
from vllm.sampling_params import SamplingParams
Expand All @@ -108,11 +108,11 @@ from outlines import models, generate
model = models.vllm("mistralai/Mistral-7b-v0.1")
generator = generate.text(model)

params = SamplingParams(n=2, length_penalty=1., min_tokens=2)
params = SamplingParams(n=2, frequence_penalty=1., min_tokens=2)
answer = generator("A prompt", sampling_params=params)
```

This also works with `generate.regex`, `generate.json`, `generate.cfg`, `generate.format` and `generate.choice`.
This also works with generators built with `generate.regex`, `generate.json`, `generate.cfg`, `generate.format` and `generate.choice`.

!!! Note

Expand Down Expand Up @@ -143,3 +143,74 @@ This also works with `generate.regex`, `generate.json`, `generate.cfg`, `generat
| `min_tokens` | `int` | Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated | `0` |
| `skip_special_tokens` | `bool` | Whether to skip special tokens in the output. | `True` |
| `spaces_between_special_tokens` | `bool` | Whether to add spaces between special tokens in the output. Defaults to True. | `True` |

### Streaming

!!! Warning

Streaming is not available for the offline vLLM integration.


## Installation

By default the vLLM library is installed with pre-commpiled C++ and CUDA binaries and will only run on GPU:

```python
pip install vllm
```

### CPU

You need to have the `gcc` compiler installed on your system. Then you will need to install vLLM from source. First clone the repository:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
```

Install the Python packages needed for the installation:

```bash
pip install --upgrade pip
pip install wheel packaging ninja setuptools>=49.4.0 numpy
pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
```

and finally run:

```bash
VLLM_TARGET_DEVICE=cpu python setup.py install
```

See the [vLLM documentation][vllm-install-cpu] for more details, alternative installation methods (Docker) and performance tips.

### ROCm


You will need to install vLLM from source. First install Pytorch on ROCm:

```bash
pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.pytorch.org/whl/nightly/rocm5.7 # tested version
```

You will then need to install flash attention for ROCm following [these instructions][rocm-flash-attention]. You can then install `xformers=0.0.23` and apply the patches needed to adapt Flash Attention for ROCm:

```bash
pip install xformers==0.0.23 --no-deps
bash patch_xformers.rocm.sh
```

And finally build vLLM:

```bash
cd vllm
pip install -U -r requirements-rocm.txt
python setup.py install # This may take 5-10 minutes.
```

See the [vLLM documentation][vllm-install-rocm] for alternative installation methods (Docker).


[vllm-install-cpu]: https://docs.vllm.ai/en/latest/getting_started/cpu-installation.html
[vllm-install-rocm]: https://docs.vllm.ai/en/latest/getting_started/amd-installation.html
[rocm-flash-attention]: https://github.com/ROCm/flash-attention/tree/flash_attention_for_rocm#amd-gpurocm-support
Loading

0 comments on commit aacc633

Please sign in to comment.