Whitespace in Regex leads to different behaviours #631
Replies: 6 comments
-
This behavior occurs very often, and it is interesting that it doesn't occur by allowing at most one white space. We need to get to the bottom of this by looking at the logits distribution (and the generated regex). It might be a problem with the way the FSM handles tokens with white spaces but we can't know before we have explored this empirically. |
Beta Was this translation helpful? Give feedback.
-
Update (from Discord): using |
Beta Was this translation helpful? Give feedback.
-
I think we should have some datasets to check for this type of variation and other future changes. |
Beta Was this translation helpful? Give feedback.
-
I think the gorillas dataset is a good starting point. |
Beta Was this translation helpful? Give feedback.
-
After reproducing this issue locally, I tried the alternative whitespace pattern, and it fails in exactly the same way as the current whitespace pattern at around 43 draws from the same RNG, so it's very likely that those "better" results are just a sampling artifact due to something like changes in the ordering of sampled elements. I added some simple debug information via a debug multinomial sampling function as follows: def alt_multinomial(logits, samples: int, rng):
probs = torch.nn.functional.softmax(logits, dim=-1)
nonzero_ids = torch.nonzero(probs.squeeze()).squeeze()
nonzero_probs = probs.squeeze()[nonzero_ids].squeeze()
sorted_nonzero_probs, sort_idx = torch.sort(nonzero_probs, descending=True)
sorted_nonzero_ids = nonzero_ids[sort_idx]
top_n_sorted_ids = sorted_nonzero_ids[:30]
top_n_sorted_probs = sorted_nonzero_probs[:30]
top_n_sorted_tokens = model.tokenizer.tokenizer.convert_ids_to_tokens(top_n_sorted_ids)
print(f"{top_n_sorted_ids=}")
print(f"{top_n_sorted_probs=}")
print(f"{top_n_sorted_tokens=}")
next_token_ids = torch.multinomial(probs, num_samples=samples, generator=rng)
print(f"{next_token_ids=}")
return next_token_ids Here is a truncated sample of what the original example in this issue produced: top_n_sorted_ids=tensor([29912, 6377, 126], device='cuda:0')
top_n_sorted_probs=tensor([7.6933e-01, 2.3066e-01, 1.1232e-05], device='cuda:0')
top_n_sorted_tokens=['{', '{"', '<0x7B>']
next_token_ids=tensor([[6377]], device='cuda:0')
top_n_sorted_ids=tensor([29874, 482, 351, 100], device='cuda:0')
top_n_sorted_probs=tensor([5.1630e-01, 3.9738e-01, 8.6315e-02, 4.5183e-06], device='cuda:0')
top_n_sorted_tokens=['a', 'age', 'ag', '<0x61>']
next_token_ids=tensor([[29874]], device='cuda:0')
top_n_sorted_ids=tensor([29887, 479, 106], device='cuda:0')
top_n_sorted_probs=tensor([7.1215e-01, 2.8767e-01, 1.7904e-04], device='cuda:0')
top_n_sorted_tokens=['g', 'ge', '<0x67>']
next_token_ids=tensor([[29887]], device='cuda:0')
top_n_sorted_ids=tensor([29872, 104], device='cuda:0')
top_n_sorted_probs=tensor([0.9938, 0.0062], device='cuda:0')
top_n_sorted_tokens=['e', '<0x65>']
next_token_ids=tensor([[29872]], device='cuda:0')
top_n_sorted_ids=tensor([29908, 1115, 37], device='cuda:0')
top_n_sorted_probs=tensor([8.8360e-01, 1.1640e-01, 3.3596e-08], device='cuda:0')
top_n_sorted_tokens=['"', '":', '<0x22>']
next_token_ids=tensor([[29908]], device='cuda:0')
top_n_sorted_ids=tensor([29871, 13, 584, 259, 1678, 418, 268, 462, 4706, 539,
3986, 18884, 308, 965, 9651, 795, 632, 1669, 29901, 61,
35], device='cuda:0')
top_n_sorted_probs=tensor([6.0663e-01, 2.9392e-01, 9.2920e-02, 2.4406e-03, 1.3480e-03, 8.0887e-04,
3.4874e-04, 3.0195e-04, 2.9884e-04, 2.1358e-04, 1.4869e-04, 1.2988e-04,
1.0754e-04, 9.9484e-05, 9.4625e-05, 6.0540e-05, 5.8317e-05, 5.6355e-05,
1.7529e-05, 3.7057e-08, 3.7056e-08], device='cuda:0')
top_n_sorted_tokens=['▁', '<0x0A>', '▁:', '▁▁', '▁▁▁', '▁▁▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', ':', '<0x3A>', '<0x20>']
next_token_ids=tensor([[29871]], device='cuda:0')
top_n_sorted_ids=tensor([ 13, 584, 29901, 29871, 462, 4706, 9651, 18884, 418, 308,
3986, 795, 259, 1669, 1678, 268, 965, 632, 539, 61,
35], device='cuda:0')
top_n_sorted_probs=tensor([9.9445e-01, 5.4149e-03, 1.0617e-04, 6.0207e-06, 2.8670e-06, 2.1342e-06,
2.1228e-06, 1.9882e-06, 1.6228e-06, 1.3470e-06, 1.3062e-06, 1.2652e-06,
1.1164e-06, 1.1083e-06, 9.3403e-07, 6.5418e-07, 5.7422e-07, 4.9309e-07,
4.8139e-07, 2.2805e-07, 2.2804e-07], device='cuda:0')
top_n_sorted_tokens=['<0x0A>', '▁:', ':', '▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '<0x3A>', '<0x20>']
next_token_ids=tensor([[13]], device='cuda:0')
top_n_sorted_ids=tensor([ 13, 29871, 1678, 4706, 9651, 259, 462, 418, 18884, 308,
268, 3986, 29901, 795, 632, 539, 965, 1669, 584, 61,
35], device='cuda:0')
top_n_sorted_probs=tensor([5.9909e-01, 1.0364e-01, 7.8102e-02, 4.9883e-02, 2.9212e-02, 2.6430e-02,
2.3243e-02, 2.2374e-02, 1.4912e-02, 1.1109e-02, 1.0445e-02, 9.9495e-03,
5.7586e-03, 5.7388e-03, 4.1190e-03, 2.7488e-03, 1.9237e-03, 1.2609e-03,
5.5879e-05, 1.1995e-08, 1.1995e-08], device='cuda:0')
top_n_sorted_tokens=['<0x0A>', '▁', '▁▁▁', '▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁', '▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁', '▁▁▁▁', '▁▁▁▁▁▁▁▁▁', ':', '▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁', '▁▁▁▁▁▁▁▁▁▁▁▁▁▁', '▁:', '<0x3A>', '<0x20>']
next_token_ids=tensor([[29871]], device='cuda:0') After sampling the final double quotes Here are the decoded results for the two highest probability tokens: model.tokenizer.tokenizer.convert_ids_to_tokens([13])
# ['<0x0A>']
model.tokenizer.decode([13])
# ['\n']
# A prefix token apparently
model.tokenizer.tokenizer.convert_ids_to_tokens([29871])
# ['▁']
model.tokenizer.decode([29871])
# [''] Apparently we're treating the prefix token by itself as an empty string. Let's take a look at the encoding and decoding situation given the prefix and sampled token results just before the observed loop: model.tokenizer.tokenizer.encode('Give me a character description{"age"\n')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 13]
model.tokenizer.tokenizer.encode('Give me a character description{"age" ')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871]
model.tokenizer.tokenizer.encode('Give me a character description{"age" \n')
# [1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871, 13]
model.tokenizer.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 13])
# '<s> Give me a character description{"age"\n'
model.tokenizer.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871])
# '<s> Give me a character description{"age" '
model.tokenizer.decode([1, 25538, 592, 263, 2931, 6139, 6377, 482, 29908, 29871])
# ['', 'Give', 'me', 'a', 'character', 'description', '{"', 'age', '"', ''] Again, the prefix token is decoded by the Also, here's the unconstrained result that would be produced after starting from the token sequences just before the loop observed during constrained generation: from outlines.generate.api import text
from outlines.generate.samplers import greedy
def test_sampler(*args):
res = greedy(*args)
print(res)
return res
text_gen = text(model, max_tokens=1, sampler=test_sampler)
text_gen('Give me a character description{"age"')
# tensor([[29871]], device='cuda:0')
# '' These results say that we're definitely getting the results we would otherwise get without the constraints at the same point in the sequence, so that's a good sign that we're at least not using the wrong support. |
Beta Was this translation helpful? Give feedback.
-
If I understand Outlines correctly, it is expected that it won't happen by allowing at most one white space? After the first generated
I'm curious, what is the point of allowing white spaces in generated JSON at all? Wouldn't always producing a minified json be more straightforward and more efficient while fixing this issue? |
Beta Was this translation helpful? Give feedback.
-
What behavior of the library made you think about the improvement?
Running above script results into Exception because LLM is generation
"age" \n \n \n \n .......
Now, When we update regex definition of whitespace to
whitespace = r"[\n ]?"
, It leads to better output generation.How would you like it to behave?
No response
Beta Was this translation helpful? Give feedback.
All reactions