Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaCpp doesnt work with generate.fsm for custom FSMs #965

Open
Radu1999 opened this issue Jun 13, 2024 · 1 comment · May be fixed by #997
Open

LlamaCpp doesnt work with generate.fsm for custom FSMs #965

Radu1999 opened this issue Jun 13, 2024 · 1 comment · May be fixed by #997
Labels
bug llama.cpp Related to the `llama.cpp` integration structured generation Linked to structured generation

Comments

@Radu1999
Copy link

Describe the issue as clearly as possible:

The example with custom fsm from documentation doesnt work for LlamaCpp as

logits, kv_cache = model(token_ids, attention_masks, kv_cache)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'LlamaCpp' object is not callable

Steps/code to reproduce the bug:

from transformers import AutoTokenizer
from outlines import models, generate
from outlines.models.transformers import TransformerTokenizer
from llama_cpp import Llama
import interegular
import torch

if __name__ == "__main__":
    # Create model
    llm = Llama("./models/Mistral-7B-Instruct-v0.2/mistral-7b-instruct-v0.2.Q5_K_M.gguf")
    model = models.LlamaCpp(llm)
    model.tokenizer = TransformerTokenizer(AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2', use_fast=True))
    model.device = 'cpu'

    # Create fsm
    list_of_strings_pattern = """\["[^"\s]*"(?:,"[^"\s]*")*\]"""
    pink_elephant_pattern = """.*(pink|elephant).*"""

    list_of_strings_fsm = interegular.parse_pattern(list_of_strings_pattern).to_fsm()
    pink_elephant_fsm = interegular.parse_pattern(pink_elephant_pattern).to_fsm()

    difference_fsm = list_of_strings_fsm - pink_elephant_fsm

    generator = generate.fsm(model, difference_fsm)
    rng = torch.Generator(device="cpu")
    rng.manual_seed(789005)

    response = generator("[INST] Don't talk about pink elephants [/INST]")
    print(response)

Expected result:

I d expect it to work :)

Error message:

No response

Outlines/Python version information:

Version information
latest

Context for the issue:

No response

@Radu1999 Radu1999 added the bug label Jun 13, 2024
@brandonwillard brandonwillard added enhancement bug structured generation Linked to structured generation llama.cpp Related to the `llama.cpp` integration and removed bug enhancement labels Jun 13, 2024
@lapp0 lapp0 linked a pull request Jun 21, 2024 that will close this issue
@lapp0
Copy link
Collaborator

lapp0 commented Jun 21, 2024

It will be a bit before this is merged into main, but you can try it early with

pip install --upgrade git+https://github.com/lapp0/outlines@fix-llamacpp-fsm

Works on my end, please let me know if you run into any issues!

rlouf pushed a commit that referenced this issue Jun 30, 2024
….py (#998)

A lot of these fixes were intended for
#966 however that's blocked
until there's a new `transformers` release.

These improvements are general to all models and will enable PRs
resolving #806 and
#965

# Structure of `OutlinesLogitsProcessor`

The goal is to create a base class which allows a logits processors to
be implemented once and used for any `outlines.models` inference
library.

To accomplish this we must normalize the input array. It must have a
consistent type (`torch.Tensor`) and consistent dimensionality (2). We
can normalize both of these simply, and without any copy operations.

`mlx.core.array`, `numpy.array`, and `torch.Tensor` all support [pythons
array standard
`__dlpack__`](https://data-apis.org/array-api/latest/API_specification/generated/array_api.array.__dlpack__.html).
This standard allows for casting between array types without copying.

`torch.Tensor` is the only input type which cannot always be cast to any
other type because torch tensors may live in GPU memory. Therefore, we
cast all arrays to `torch.Tensor`, implement logits processors using
torch methods, and convert back to the original array type in
`OutlinesLogitsProcessor`. See docstring of
`OutlinesLogitsProcessor.__call__()` for more details.

# Detailed Changes
- Rename `BaseLogitsProcessor` to `OutlinesLogitsProcessor`
- Ensure `OutlinesLogitsProcessor.process_logits()` is always passed a
2D batch request with `torch.Tensor` logits and `List` input_ids. Also
clean up code to be more readable in `OutlinesLogitsProcessor__call__()`
- Ensure `FSMLogitsProcessor` allows unstable sequence ordering (beam
search in transformers and vLLM change the order of sequences)
- Update `tests/generate/test_generate.py` to cover more permutations of
  - regex / text 
  - batch / single
  - greedy / multinomial / beam search
  - `stream()` / `generate()`
- Ensure performance stability with difference array libraries through
`benchmark_processors.py`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug llama.cpp Related to the `llama.cpp` integration structured generation Linked to structured generation
Projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

3 participants