Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate.regex() fails to generate regex-constrained text with madlad400-3b-mt T5 model #1133

Open
oflakne26 opened this issue Sep 6, 2024 · 1 comment
Labels

Comments

@oflakne26
Copy link

oflakne26 commented Sep 6, 2024

Describe the issue as clearly as possible:

I love the Outlines project, especially because it allows for various model configurations while maintaining constraint.

However, I have been unable to use Outlines with my desired model: https://huggingface.co/jbochi/madlad400-3b-mt

While text generation works just fine using this T5 Seq2Seq model, regex does not.

Steps/code to reproduce the bug:

from outlines import models, generate
from transformers import AutoModelForSeq2SeqLM
 
model = models.transformers("jbochi/madlad400-3b-mt", model_class=AutoModelForSeq2SeqLM, device="cpu")
 
print("Model loaded!")
 
# Initialize the generator
generator = generate.regex(model, regex_str=r"hello|goodbye")
 
# Generate a response
answer = generator("<2en> hola")
 
# Print the response
print(answer)

Expected result:

hello

Error message:

RuntimeError: Cannot convert token `X` (XXXXX) to bytes: X

Outlines/Python version information:

I am on the latest version of Outlines -- I installed it using the Repo's GitHub URL.

However, I tested this code with various versions of Outlines and none seemed to help.

I am on MacOS.

Context for the issue:

The main difference between Outlines and other popular libraries such as Microsoft's Guidance is its versatility, like its support for T5 models. However, this model is now completely unusable.

@oflakne26 oflakne26 added the bug label Sep 6, 2024
@oflakne26 oflakne26 changed the title generate.regex() fails to load madlad400-3b-mt T5 model generate.regex() fails to generate with madlad400-3b-mt T5 model Sep 6, 2024
@oflakne26 oflakne26 changed the title generate.regex() fails to generate with madlad400-3b-mt T5 model generate.regex() fails to generate regex-constrained text with madlad400-3b-mt T5 model Sep 6, 2024
@lapp0
Copy link
Contributor

lapp0 commented Sep 15, 2024

I ran your reproduction script, thanks for informing us about this issue.

Here are some samples of tokens (in byte format) which cause the Error in your model:

  • \xef\xbf\xbd\xe2\x80\x9e
  • \xef\xbf\xbd\xc2\xb4
  • \xe2\x80\x94\xef\xbf\xbd
  • \xef\xbf\xbd\xc2\xbc

(Interestingly, each of these contain the replacement character, \xef\xbf\xbd, perhaps indicating whether the bytes are a prefix or suffix to a complete character)

This seems to be a fourth method by which a tokenizer encodes "incomplete unicode" characters

https://github.com/dottxt-ai/outlines/blob/62b7601/outlines/fsm/regex.py#L910-L915

This issue should be resolved by researching an effective method for standarizing tokenizers representations of incomplete unicode. If there isn't one then we can updated re_llama_byte_token in regex.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants