Benchmarking model performance with and without Outlines #336

MayankAgarwal · 2023-11-06T22:42:41Z

MayankAgarwal
Nov 6, 2023

Are there benchmark datasets we can try to assess model performance with and without Outlines, for example: a dataset that either has JSON as output or for whom the output can be modeled as a relatively complex JSON?

The reason I am asking is that while it conceptually makes sense to do constrained generation and how it would help the model generate valid output, sometimes there is likelihood misalignment or likelihood collapse resulting in degenerate output, such as empty strings or spaces in JSON until the maximum tokens are generated. In my opinion, having benchmark performances with and without constraints will provide better insights into the gains Outlines provides.

Looking for everyone's thoughts on this topic and if there are datasets we can start with. Thanks!

rlouf · 2023-11-06T22:53:47Z

rlouf
Nov 6, 2023
Maintainer

Not such dataset as far as I know. Do you have examples of such failures?

7 replies

janvandervegt-db Nov 13, 2023

I'm running into similar issues. Without constraints I'm getting much better quality responses (sometimes with some random text before or after the JSON, sometimes it doesn't actually adhere to JSON but the content is good) than with the constraints. This happens with various OSS models (LLaMa-2 70b, MPT 7b/30b).

I'm going to spend some time today to collect some data to fine-tune a model that will get a task, the output format and then produces the corresponding JSON. My expectation is that a model fine-tuned on these JSON instructions should do much better in both scenarios.

rlouf Nov 13, 2023
Maintainer

We have ideas of how we could make the guided generation provide better quality output, but first I need to understand what is meant by "much better quality" in this context. How is it measured precisely?

rlouf Nov 20, 2023
Maintainer

@MayankAgarwal Any chance you could share the general structure of the prompt you used for the constrained and unconstrained cases? Is the model you used an instruct model?

janvandervegt-db Nov 20, 2023

It looks like I made a mistake with my instruct model by not using the right instruction template, it's working much better than before. I will do some non-sensitive, qualitative tests this week to compare constrained and unconstrained generation.

janvandervegt-db Nov 21, 2023

Using mpt-30b-instruct, I'm getting very similar semantic results while the structure is not always accurate without outlines (disregarding the extra instruction text, the produced JSON is not valid).

Here is the prompt used with and without:

Generate a list of 5 non-fiction books. Make sure that there is a good amount of variety and that the descriptions are long.

# EXAMPLE

{"author":{"first_name":"Stephen","last_name":"King"},"title":"The Dark Half","genre":"fiction","year":1989,"pages":200,"descr":"Thad Beaumont is an author and recovering alcoholic who lives in the town of Ludlow, Maine. Thad\'s own books – cerebral literary fiction – are not very successful. However, under the pen name \\"George Stark\\", he writes highly successful crime novels about a psychopathic killer named Alexis Machine. When Thad\'s authorship of Stark\'s novels becomes public knowledge, Thad and his wife, Elizabeth, decide to stage a mock burial for his alter ego at the local cemetery, which is featured in a People magazine article. Stark\'s epitaph says it all: \\"Not A Very Nice Guy.\\"\\n"}

# OUTPUT INSTRUCTIONS

Answer in valid JSON. Here are the different objects relevant for output, with their properties:

Author:
  first_name (str): First name author
  last_name (str): Last name author

Book:
  author (Author)
  title (str): Title of book
  genre (str): Book genre
  year (int): Year of release
  pages (int): Number of pages
  descr (str): Description of the book, at least 5 full sentences

Return a valid JSON list of type `Book`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking model performance with and without Outlines #336

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmarking model performance with and without Outlines #336

MayankAgarwal Nov 6, 2023

Replies: 1 comment · 7 replies

rlouf Nov 6, 2023 Maintainer

janvandervegt-db Nov 13, 2023

rlouf Nov 13, 2023 Maintainer

rlouf Nov 20, 2023 Maintainer

janvandervegt-db Nov 20, 2023

janvandervegt-db Nov 21, 2023

MayankAgarwal
Nov 6, 2023

Replies: 1 comment 7 replies

rlouf
Nov 6, 2023
Maintainer

rlouf Nov 13, 2023
Maintainer

rlouf Nov 20, 2023
Maintainer