support for Unicode regular expressions #937

skunkwerk · 2024-06-01T00:53:04Z

What behavior of the library made you think about the improvement?

I'm trying to restrict the output of a multi-lingual LLM to a single language (Korean), as it was trained in multiple languages and sometimes mixes them in the output.

with this regular expression:

([0-9\uAC00-\uD7AF\u1100-\u11FF\u3130-\u318F\uA960-\uA97F\uD7B0-\uD7FF\n,\s\.\']*)

I get the error:

Interegular: regex module unicode properties are not supported

How would you like it to behave?

There should be a way to restrict output to a specific language's character set.

The text was updated successfully, but these errors were encountered:

plaunezkiy · 2024-07-02T20:57:29Z

The grammar is parsed via Lark (have a look in their docs, import unicode functionality and try again)
https://lark-parser.readthedocs.io/en/stable/grammar.html#import

lapp0 · 2024-07-02T23:41:49Z

It's not obvious to me why your expression fails, but generator = generate.regex(model, r'[😨]+') works. Maybe we need to update Outlines so it allows escaped unicode along with literal unicode?

Could you leave the issue open so we can address this at some point, but for now, try the literals instead? e.g. instead of \uAC00 use 가

skunkwerk added the enhancement label Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for Unicode regular expressions #937

support for Unicode regular expressions #937

skunkwerk commented Jun 1, 2024

plaunezkiy commented Jul 2, 2024

lapp0 commented Jul 2, 2024

support for Unicode regular expressions #937

support for Unicode regular expressions #937

Comments

skunkwerk commented Jun 1, 2024

What behavior of the library made you think about the improvement?

How would you like it to behave?

plaunezkiy commented Jul 2, 2024

lapp0 commented Jul 2, 2024