Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for Unicode regular expressions #937

Open
skunkwerk opened this issue Jun 1, 2024 · 2 comments
Open

support for Unicode regular expressions #937

skunkwerk opened this issue Jun 1, 2024 · 2 comments

Comments

@skunkwerk
Copy link

What behavior of the library made you think about the improvement?

I'm trying to restrict the output of a multi-lingual LLM to a single language (Korean), as it was trained in multiple languages and sometimes mixes them in the output.

with this regular expression:

([0-9\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uD7B0-\uD7FF\n,\s\.\']*)

I get the error:

Interegular: regex module unicode properties are not supported

How would you like it to behave?

There should be a way to restrict output to a specific language's character set.

@plaunezkiy
Copy link

The grammar is parsed via Lark (have a look in their docs, import unicode functionality and try again)
https://lark-parser.readthedocs.io/en/stable/grammar.html#import

@lapp0
Copy link
Collaborator

lapp0 commented Jul 2, 2024

It's not obvious to me why your expression fails, but generator = generate.regex(model, r'[😨]+') works. Maybe we need to update Outlines so it allows escaped unicode along with literal unicode?

Could you leave the issue open so we can address this at some point, but for now, try the literals instead? e.g. instead of \uAC00 use

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants