JSON generation fails for Llama-2-7b-chat-hf #692

vegaluisjose · 2024-02-20T20:10:12Z

vegaluisjose
Feb 20, 2024

Describe the issue as clearly as possible:

When running vLLM serve with outlines version 0.0.32 and Llama-2-7b-chat-hf the output does not create a valid json output. The serve.py I am using is

# Copyright 2023 the vLLM developers

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

#     http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import json
from typing import AsyncGenerator

import uvicorn
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse, Response, StreamingResponse
from vllm.engine.arg_utils import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.sampling_params import SamplingParams
from vllm.utils import random_uuid

from .vllm import JSONLogitsProcessor, RegexLogitsProcessor

TIMEOUT_KEEP_ALIVE = 5  # seconds.
TIMEOUT_TO_PREVENT_DEADLOCK = 1  # seconds.
app = FastAPI()
engine = None


@app.get("/health")
async def health() -> Response:
    """Health check."""
    return Response(status_code=200)


@app.post("/generate")
async def generate(request: Request) -> Response:
    """Generate completion for the request.

    The request should be a JSON object with the following fields:
    - prompt: the prompt to use for the generation.
    - schema: the JSON schema to use for the generation (if regex is not provided).
    - regex: the regex to use for the generation (if schema is not provided).
    - stream: whether to stream the results or not.
    - other fields: the sampling parameters (See `SamplingParams` for details).
    """
    assert engine is not None

    request_dict = await request.json()
    prompt = request_dict.pop("prompt")
    stream = request_dict.pop("stream", False)

    json_schema = request_dict.pop("schema", None)
    regex_string = request_dict.pop("regex", None)
    if json_schema is not None:
        logits_processors = [JSONLogitsProcessor(json_schema, engine.engine)]
    elif regex_string is not None:
        logits_processors = [RegexLogitsProcessor(regex_string, engine.engine)]
    else:
        logits_processors = []

    sampling_params = SamplingParams(
        **request_dict, logits_processors=logits_processors  # type: ignore
    )
    request_id = random_uuid()

    results_generator = engine.generate(prompt, sampling_params, request_id)  # type: ignore

    # Streaming case
    async def stream_results() -> AsyncGenerator[bytes, None]:
        async for request_output in results_generator:
            prompt = request_output.prompt
            text_outputs = [prompt + output.text for output in request_output.outputs]
            ret = {"text": text_outputs}
            yield (json.dumps(ret) + "\0").encode("utf-8")

    if stream:
        return StreamingResponse(stream_results())

    # Non-streaming case
    final_output = None
    async for request_output in results_generator:
        if await request.is_disconnected():
            # Abort the request if the client disconnects.
            await engine.abort(request_id)  # type: ignore
            return Response(status_code=499)
        final_output = request_output

    assert final_output is not None
    ret = {"text": final_output.outputs[0].text}
    return JSONResponse(ret)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--host", type=str, default=None)
    parser.add_argument("--port", type=int, default=8000)
    parser.add_argument("--ssl-keyfile", type=str, default=None)
    parser.add_argument("--ssl-certfile", type=str, default=None)
    parser = AsyncEngineArgs.add_cli_args(parser)
    args = parser.parse_args()

    # Adds the `engine_use_ray`,  `disable_log_requests` and `max_log_len`
    # arguments
    engine_args: AsyncEngineArgs = AsyncEngineArgs.from_cli_args(args)  # type: ignore

    # Sets default for the model (`facebook/opt-125m`)
    engine = AsyncLLMEngine.from_engine_args(engine_args)

    uvicorn.run(
        app,
        host=args.host,
        port=args.port,
        log_level="debug",
        timeout_keep_alive=TIMEOUT_KEEP_ALIVE,
        ssl_keyfile=args.ssl_keyfile,
        ssl_certfile=args.ssl_certfile,
    )

Steps/code to reproduce the bug:

curl localhost:10001/generate -H "content-type: application/json" -d '{
  "prompt": "toyota corolla black",
  "max_tokens": 128,
  "schema": {
    "properties": {
      "color": {
        "title": "Color",
        "type": "string"
      },
      "maker": {
        "title": "Maker",
        "type": "string"
      }
    },
    "required": [
      "color",
      "maker"
    ],
    "title": "Car",
    "type": "object"
  }
}'

Expected result:

{"text":"{\n            \"color\": \"black\",\n            \"maker\": \"Toyota\"\n          }"}

Error message:

{"text":"{\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"}

Outlines/Python version information:

Version information

``` python3 -m pip show outlines Name: outlines Version: 0.0.32 Summary: Probabilistic Generative Model Programming Home-page: Author: Outlines Developers Author-email: ```

Context for the issue:

I would like to know if this is a known issue for Llama-2-7b-chat-hf? thanks!

rlouf · 2024-02-20T21:18:57Z

rlouf
Feb 20, 2024
Maintainer

It's a known problem with small models. By default, Outlines lets the model choose the number of line breaks and white spaces (following the JSON standard). In outlines.generate.json you can play with the white space pattern for the JSON, usually passing whitespace_pattern="" works better.

0 replies

vegaluisjose · 2024-02-20T23:52:56Z

vegaluisjose
Feb 20, 2024
Author

Interesting @rlouf I updated the above JSONLogitsProcessor constructor with the whitespace_pattern and it seems to work

logits_processors = [JSONLogitsProcessor(json_schema, engine.engine, whitespace_pattern="")]

Produces

{"text":"{\"color\":\">Black•Rust\",\"maker\":\">Toyota\"}"}

Do you know is it generally safe to define this whitespace-pattern for all models? or just llama ones?

0 replies

lapp0 · 2024-02-21T08:48:56Z

lapp0
Feb 21, 2024

@vegaluisjose this should be fine for any model which is capable of generating json without whitespace separation, which I believe should be almost all of them. IMHO, [ \n\t]? is slightly safer though.

@rlouf We should consider making the whitespace_pattern default to [ \n\t]?.

0 replies

rlouf · 2024-02-21T09:35:18Z

rlouf
Feb 21, 2024
Maintainer

Do you know is it generally safe to define this whitespace-pattern for all models? or just llama ones?

My honest answer is I don't know, for now it's all empirical. When you guide generation, you force the sequences to be sampled from a subspace of all the sequences that can possibly be sampled given the prompt. When you set a whitespace pattern, you are further reducing the size of that subspace. It might be that this overall gives better results because it's the "right" subset, but hard to tell without evaluation.

Note that all of this is also conditional on the type of sampling algorithm that you use (greedy, multinomial, beam search, etc.), so it might also be that what works with one algorithm, actually doesn't with another one.

Structured generation is more like a new line of research than a simple feature :)

0 replies

pseudotensor · 2024-05-18T01:12:22Z

pseudotensor
May 18, 2024

I've tried setting whitespace_pattern to ' ' or '[ \n\t]' and the latter causes severe issues with generation. The space version also causes problems and leads to The vocabulary does not allow us to build a sequence that matches the input regex

What do you recommend @rlouf ?

1 reply

rlouf May 19, 2024
Maintainer

To open an issue on the outlines repo regarding the latter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON generation fails for Llama-2-7b-chat-hf #692

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

JSON generation fails for Llama-2-7b-chat-hf #692

vegaluisjose Feb 20, 2024

Describe the issue as clearly as possible:

Steps/code to reproduce the bug:

Expected result:

Error message:

Outlines/Python version information:

Context for the issue:

Replies: 5 comments · 1 reply

rlouf Feb 20, 2024 Maintainer

vegaluisjose Feb 20, 2024 Author

lapp0 Feb 21, 2024

rlouf Feb 21, 2024 Maintainer

pseudotensor May 18, 2024

rlouf May 19, 2024 Maintainer

vegaluisjose
Feb 20, 2024

Replies: 5 comments 1 reply

rlouf
Feb 20, 2024
Maintainer

vegaluisjose
Feb 20, 2024
Author

lapp0
Feb 21, 2024

rlouf
Feb 21, 2024
Maintainer

pseudotensor
May 18, 2024

rlouf May 19, 2024
Maintainer