Add a non_strict mode to deal with JSONDecodeError errors #985

davanstrien · 2024-06-19T09:00:18Z

What behavior of the library made you think about the improvement?

It might be nice to have the option to have a non_strict mode when doing generations for large batches of data. This could be particularly helpful for creating synthetic data where you usually don't care too much about skipping some prompts but do care about having failures to a computationally expensive pipeline.

Currently, when using a generator constructed using a JSON Schema/Pydantic Class, i.e.

generator = generate.json(model, AbstractDescriptions)

and calling the generator

results = generator(prompts, sampling_params=params)

I am running into JSONDecodeError errors with the source being a ValidationError for an Unterminated string.

ValidationError: 1 validation error for AbstractDescriptions
__root__
  Unterminated string starting at: line 1 column 631 (char 630) [type=value_error.jsondecode, input_value='{ "good":[ "]bad1_1_1_1_...1_1_1_1_1_1_1_1_1_1_1_1', input_type=str]

It is currently possible to attempt to fix this using different whitespace patterns, increasing best_of values or using a different LLM.

This approach to addressing the issue works okay if the error occurs regularly, but it's a bit annoying if you process large amounts of prompts and only get the error late. One approach you can take now is to manage the generation in smaller batches with a try/except block and either reattempt the failed batch or skip it.

How would you like it to behave?

It might be nice instead to have an option flag to manage some exceptions when calling the generator and returning None for the failed generations, i.e.

results = generator(prompts, sampling_params=params, non_strict=True)

Which would return something like:

[ JSON, JSON, None]

I am not very familiar with Outlines' internals, and I do not know how feasible it is to add this across all the LLM engines currently supported. This option would obviously make sense to be off by default, but IMO, it could be useful for some workloads where you care about the generations being correct but don't mind too much if one or two prompts result in None being generated.

I couldn't find anything proposing this before (I might have missed an issue, but some other related issues:
#759
#612

The text was updated successfully, but these errors were encountered:

chris-aeviator · 2024-07-16T05:55:59Z

IMO this is even a critical bug rather than an enhancement.

Outlines will catch this only if a complete batch is finished and then lose all data. I'm running outlines with ~20.000 prompts/hr .

What helped me to reduce the issue is constraining max field length Field(..., max_length=) , a quick fix seems to be

/generate/json.py

def safe_parse(schema_object, x):
    try:
        return schema_object.parse_raw(x)
    except Exception:
        return None
# [...]

    if isinstance(schema_object, type(BaseModel)):
        schema = pyjson.dumps(schema_object.model_json_schema())
        regex_str = build_regex_from_schema(schema, whitespace_pattern)
        generator = regex(model, regex_str, sampler)
        generator.format_sequence = lambda x: safe_parse(schema_object,x)

hugolytics · 2024-07-30T12:31:03Z

I can second the need for this feature! Maybe it would be possible to fix by in the backend setting a pydantic type_adapter that can choose between Union[YourModel,None], so basically if the input data doesn't validate, it will return None (or str, or whatever).

davanstrien added the enhancement label Jun 19, 2024

lapp0 mentioned this issue Jun 19, 2024

Invalid generate.json Output with models.llamacpp #973

Closed

rlouf added the bug label Jul 16, 2024

rlouf added this to Improve Outlines Jul 19, 2024

rlouf moved this to Todo in Improve Outlines Jul 19, 2024

lapp0 added JSON correctness Everything related to the generation correctness labels Aug 12, 2024

lapp0 mentioned this issue Aug 12, 2024

The example provided by the official source has crashed. #1075

Open

lapp0 mentioned this issue Aug 31, 2024

WIP: Fix Various JSON-Schema Generation Bugs lapp0/outlines#88

Open

This was referenced Sep 16, 2024

Fix Infinite Repetition in JSON Schemas Using Integer and String #1154

Draft

Allow Non-Strict mode in generate.json #1159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a non_strict mode to deal with JSONDecodeError errors #985

Add a non_strict mode to deal with JSONDecodeError errors #985

davanstrien commented Jun 19, 2024

chris-aeviator commented Jul 16, 2024 •

edited

Loading

hugolytics commented Jul 30, 2024

Add a non_strict mode to deal with JSONDecodeError errors #985

Add a non_strict mode to deal with JSONDecodeError errors #985

Comments

davanstrien commented Jun 19, 2024

What behavior of the library made you think about the improvement?

How would you like it to behave?

chris-aeviator commented Jul 16, 2024 • edited Loading

hugolytics commented Jul 30, 2024

chris-aeviator commented Jul 16, 2024 •

edited

Loading