Whitespace in Regex leads to different behaviours #631

immortal3 · 2023-12-26T11:31:06Z

immortal3
Dec 26, 2023

What behavior of the library made you think about the improvement?

from enum import Enum
from pydantic import BaseModel, constr

import outlines
from outlines import generate
import torch


model = outlines.models.transformers(
    "TinyLlama/TinyLlama-1.1B-Chat-v0.6",
    device="cuda",
    model_kwargs={"trust_remote_code": True},
)


class Weapon(str, Enum):
    sword = "sword"
    axe = "axe"
    mace = "mace"
    spear = "spear"
    bow = "bow"
    crossbow = "crossbow"


class Armor(str, Enum):
    leather = "leather"
    chainmail = "chainmail"
    plate = "plate"


class Character(BaseModel):
    age: int
    armor: Armor
    weapon: Weapon
    strength: int


# Construct guided sequence generator
generator = generate.json(model, Character, max_tokens=200)

# Draw a sample
rng = torch.Generator(device="cuda")
rng.manual_seed(789001)

sequence = generator("Give me a character description", rng=rng)
print(sequence)

Running above script results into Exception because LLM is generation "age" \n \n \n \n .......

  File "C:\Users\dippa\anaconda3\envs\ml\Lib\site-packages\pydantic\main.py", line 1079, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for Character
__root__
  Expecting ':' delimiter: line 98 column 2 (char 202) [type=value_error.jsondecode, input_value='{"age" \n \n  \n \n \n \...n \n \n \n \n \n \n \n ', input_type=str]

Now, When we update regex definition of whitespace to whitespace = r"[\n ]?", It leads to better output generation.

age=71 armor=<Armor.plate: 'plate'> weapon=<Weapon.sword: 'sword'> strength=95

How would you like it to behave?

No response

rlouf · 2023-12-26T11:57:58Z

rlouf
Dec 26, 2023
Maintainer

This behavior occurs very often, and it is interesting that it doesn't occur by allowing at most one white space. We need to get to the bottom of this by looking at the logits distribution (and the generated regex). It might be a problem with the way the FSM handles tokens with white spaces but we can't know before we have explored this empirically.

0 replies

rlouf · 2024-01-02T08:32:46Z

rlouf
Jan 2, 2024
Maintainer

Update (from Discord): using [\n]*[ ]* as a regex for white spaces seems to produce "better" results. It would be nice to have a couple of comparisons here before we open a PR.

0 replies

immortal3 · 2024-01-02T15:42:37Z

immortal3
Jan 2, 2024
Author

I think we should have some datasets to check for this type of variation and other future changes.
I don't have a concrete idea for a dataset but ready to help. This could also be the gold standard for JSON-related things across community.

0 replies

rlouf · 2024-01-02T15:57:48Z

rlouf
Jan 2, 2024
Maintainer

I think the gorillas dataset is a good starting point.

0 replies

brandonwillard · 2024-01-13T02:39:20Z

brandonwillard
Jan 13, 2024
Maintainer

Update (from Discord): using [\n]*[ ]* as a regex for white spaces seems to produce "better" results. It would be nice to have a couple of comparisons here before we open a PR.

After reproducing this issue locally, I tried the alternative whitespace pattern, and it fails in exactly the same way as the current whitespace pattern at around 43 draws from the same RNG, so it's very likely that those "better" results are just a sampling artifact due to something like changes in the ordering of sampled elements.

I added some simple debug information via a debug multinomial sampling function as follows:

def alt_multinomial(logits, samples: int, rng):
    probs = torch.nn.functional.softmax(logits, dim=-1)

    nonzero_ids = torch.nonzero(probs.squeeze()).squeeze()
    nonzero_probs = probs.squeeze()[nonzero_ids].squeeze()

    sorted_nonzero_probs, sort_idx = torch.sort(nonzero_probs, descending=True)
    sorted_nonzero_ids = nonzero_ids[sort_idx]

    top_n_sorted_ids = sorted_nonzero_ids[:30]
    top_n_sorted_probs = sorted_nonzero_probs[:30]
    top_n_sorted_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(top_n_sorted_ids)

    print(f"{top_n_sorted_ids=}")
    print(f"{top_n_sorted_probs=}")
    print(f"{top_n_sorted_tokens=}")

    next_token_ids = torch.multinomial(probs, num_samples=samples, generator=rng)

    print(f"{next_token_ids=}")

    return next_token_ids

Here is a truncated sample of what the original example in this issue produced:

top_n_sorted_ids=tensor([29912,  6377,   126], device='cuda:0')
top_n_sorted_probs=tensor([7.6933e-01, 2.3066e-01, 1.1232e-05], device='cuda:0')
top_n_sorted_tokens=['{', '{"', '<0x7B>']
next_token_ids=tensor([[6377]], device='cuda:0')
top_n_sorted_ids=tensor([29874,   482,   351,   100], device='cuda:0')
top_n_sorted_probs=tensor([5.1630e-01, 3.9738e-01, 8.6315e-02, 4.5183e-06], device='cuda:0')
top_n_sorted_tokens=['a', 'age', 'ag', '<0x61>']
next_token_ids=tensor([[29874]], device='cuda:0')
top_n_sorted_ids=tensor([29887,   479,   106], device='cuda:0')
top_n_sorted_probs=tensor([7.1215e-01, 2.8767e-01, 1.7904e-04], device='cuda:0')
top_n_sorted_tokens=['g', 'ge', '<0x67>']
next_token_ids=tensor([[29887]], device='cuda:0')
top_n_sorted_ids=tensor([29872,   104], device='cuda:0')
top_n_sorted_probs=tensor([0.9938, 0.0062], device='cuda:0')
top_n_sorted_tokens=['e', '<0x65>']
next_token_ids=tensor([[29872]], device='cuda:0')
top_n_sorted_ids=tensor([29908,  1115,    37], device='cuda:0')
top_n_sorted_probs=tensor([8.8360e-01, 1.1640e-01, 3.3596e-08], device='cuda:0')
top_n_sorted_tokens=['"', '":', '<0x22>']
next_token_ids=tensor([[29908]], device='cuda:0')
top_n_sorted_ids=tensor([29871,    13,   584,   259,  1678,   418,   268,   462,  4706,   539,
         3986, 18884,   308,   965,  9651,   795,   632,  1669, 29901,    61,
           35], device='cuda:0')
top_n_sorted_probs=tensor([6.0663e-01, 2.9392e-01, 9.2920e-02, 2.4406e-03, 1.3480e-03, 8.0887e-04,
        3.4874e-04, 3.0195e-04, 2.9884e-04, 2.1358e-04, 1.4869e-04, 1.2988e-04,
        1.0754e-04, 9.9484e-05, 9.4625e-05, 6.0540e-05, 5.8317e-05, 5.6355e-05,
        1.7529e-05, 3.7057e-08, 3.7056e-08], device='cuda:0')
top_n_sorted_tokens=['▁', '<0x0A>', '▁:', '▁▁', '▁▁▁', '▁▁▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', ':', '<0x3A>', '<0x20>']
next_token_ids=tensor([[29871]], device='cuda:0')
top_n_sorted_ids=tensor([   13,   584, 29901, 29871,   462,  4706,  9651, 18884,   418,   308,
         3986,   795,   259,  1669,  1678,   268,   965,   632,   539,    61,
           35], device='cuda:0')
top_n_sorted_probs=tensor([9.9445e-01, 5.4149e-03, 1.0617e-04, 6.0207e-06, 2.8670e-06, 2.1342e-06,
        2.1228e-06, 1.9882e-06, 1.6228e-06, 1.3470e-06, 1.3062e-06, 1.2652e-06,
        1.1164e-06, 1.1083e-06, 9.3403e-07, 6.5418e-07, 5.7422e-07, 4.9309e-07,
        4.8139e-07, 2.2805e-07, 2.2804e-07], device='cuda:0')
top_n_sorted_tokens=['<0x0A>', '▁:', ':', '▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '<0x3A>', '<0x20>']
next_token_ids=tensor([[13]], device='cuda:0')
top_n_sorted_ids=tensor([   13, 29871,  1678,  4706,  9651,   259,   462,   418, 18884,   308,
          268,  3986, 29901,   795,   632,   539,   965,  1669,   584,    61,
           35], device='cuda:0')
top_n_sorted_probs=tensor([5.9909e-01, 1.0364e-01, 7.8102e-02, 4.9883e-02, 2.9212e-02, 2.6430e-02,
        2.3243e-02, 2.2374e-02, 1.4912e-02, 1.1109e-02, 1.0445e-02, 9.9495e-03,
        5.7586e-03, 5.7388e-03, 4.1190e-03, 2.7488e-03, 1.9237e-03, 1.2609e-03,
        5.5879e-05, 1.1995e-08, 1.1995e-08], device='cuda:0')
top_n_sorted_tokens=['<0x0A>', '▁', '▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁', ':', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁:', '<0x3A>', '<0x20>']
next_token_ids=tensor([[29871]], device='cuda:0')

After sampling the final double quotes " around the age field, [\\n ]*: is allowed next and the permissible tokens are top_n_sorted_tokens=['▁', '▁▁', '▁▁▁', '<0x0A>', '▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', ':', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁:', '<0x3A>', '<0x20>']. This is where the model appears to start having issues, because it loops through the tokens allowed by that sub-regex until it hits the max tokens limit.

Here are the decoded results for the two highest probability tokens:

model.tokenizer.tokenizer.convert_ids_to_tokens([13])
# ['<0x0A>']
model.tokenizer.decode([13])
# ['\n']

# A prefix token apparently
model.tokenizer.tokenizer.convert_ids_to_tokens([29871])
# ['▁']
model.tokenizer.decode([29871])
# ['']

Apparently we're treating the prefix token by itself as an empty string.

Let's take a look at the encoding and decoding situation given the prefix and sampled token results just before the observed loop:

model.tokenizer.tokenizer.encode('Give me a character description{"age"\n')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 13]

model.tokenizer.tokenizer.encode('Give me a character description{"age" ')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871]

model.tokenizer.tokenizer.encode('Give me a character description{"age" \n')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871, 13]

model.tokenizer.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 13])
# '<s> Give me a character description{"age"\n'

model.tokenizer.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871])
# '<s> Give me a character description{"age" '

model.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871])
# ['', 'Give', 'me', 'a', 'character', 'description', '{"', 'age', '"', '']

Again, the prefix token is decoded by the transformers code as a space, but it's empty by our decoding, so we should look into that; otherwise, next steps would involve finding out whether or not this subset that keeps repeating is the correct subset of allowable tokens in the above sequence. At the very least, it does consist of allowable characters.

Also, here's the unconstrained result that would be produced after starting from the token sequences just before the loop observed during constrained generation:

from outlines.generate.api import text
from outlines.generate.samplers import greedy


def test_sampler(*args):
    res = greedy(*args)
    print(res)
    return res


text_gen = text(model, max_tokens=1, sampler=test_sampler)

text_gen('Give me a character description{"age"')
# tensor([[29871]], device='cuda:0')
# ''

These results say that we're definitely getting the results we would otherwise get without the constraints at the same point in the sequence, so that's a good sign that we're at least not using the wrong support.

0 replies

jc-louis · 2024-02-01T14:35:50Z

jc-louis
Feb 1, 2024

This behavior occurs very often, and it is interesting that it doesn't occur by allowing at most one white space

If I understand Outlines correctly, it is expected that it won't happen by allowing at most one white space? After the first generated \n we forbid any additionnal \n whatever the logits?

using [\n]*[ ]* as a regex for white spaces seems to produce "better" results

I'm curious, what is the point of allowing white spaces in generated JSON at all? Wouldn't always producing a minified json be more straightforward and more efficient while fixing this issue?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whitespace in Regex leads to different behaviours #631

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Whitespace in Regex leads to different behaviours #631

immortal3 Dec 26, 2023

What behavior of the library made you think about the improvement?

How would you like it to behave?

Replies: 6 comments

rlouf Dec 26, 2023 Maintainer

rlouf Jan 2, 2024 Maintainer

immortal3 Jan 2, 2024 Author

rlouf Jan 2, 2024 Maintainer

brandonwillard Jan 13, 2024 Maintainer

jc-louis Feb 1, 2024

immortal3
Dec 26, 2023

rlouf
Dec 26, 2023
Maintainer

rlouf
Jan 2, 2024
Maintainer

immortal3
Jan 2, 2024
Author

rlouf
Jan 2, 2024
Maintainer

brandonwillard
Jan 13, 2024
Maintainer

jc-louis
Feb 1, 2024