Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate JSON Schema to YAML regex #1022

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

patricebechard
Copy link
Contributor

@patricebechard patricebechard commented Jul 6, 2024

This is a tentative implementation of a regex generator for arbitrary YAML given a JSON Schema. This PR relates to #923 .

There are still some issues:

@rlouf @lapp0

@rlouf rlouf marked this pull request as ready for review July 9, 2024 19:46
@rlouf
Copy link
Member

rlouf commented Jul 12, 2024

Thank you for the PR! It looks like many tests related to these changes are failing…

Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on this! Thanks so much for implementing.

Could we harden this a bit with some test cases? I've written some test cases that could help

@patricebechard
Copy link
Contributor Author

I have already copied some tests from test_json_schema.py and updated them for the YAML use case, but I am open to reusing the code as you have done it (or simply add more tests from the test_json file and edit them).

Given that you're refactoring some of the code on your branch, what is the easiest way to go forward to minimize the amount of duplication / deprecation of code?

@lapp0
Copy link
Contributor

lapp0 commented Jul 17, 2024

I have already copied some tests from test_json_schema.py and updated them for the YAML use case, but I am open to reusing the code as you have done it (or simply add more tests from the test_json file and edit them).

Given that you're refactoring some of the code on your branch, what is the easiest way to go forward to minimize the amount of duplication / deprecation of code?

I've refactored json_schema.py, but unless you're interested in incorporating those changes, we can just focus on test_json_schema.py for now. Simply replacing test_json_schema.py on your branch with my branches version and ensuring it works is sufficient. The new module simply ensures the tested behavior in json_schema.py is matched by yaml_schema.py.

Please let me know if you have any other questions.

@lapp0
Copy link
Contributor

lapp0 commented Jul 29, 2024

@patricebechard is there anything I can do to help with this?

@patricebechard
Copy link
Contributor Author

sorry, was quite busy lately, but I can work on it this week, will let you know if I need help with anything

@lapp0
Copy link
Contributor

lapp0 commented Jul 29, 2024

No worries at all, thanks for your continued work!

@patricebechard
Copy link
Contributor Author

I was finally able to make some changes including support for indentation.

Some caveats:

  • I am currently skipping some tests as the behavior between yaml and json differs for some cases (e.g. the datetimes without quotes)
  • the implementation differs from what @lapp0 has on his branch, which would mean we would have to do some refactoring at some point.

I am also making sure that we support both quoted and unquoted strings for YAML. Since one of the main advantages of using YAML for guided generation is that there are less tokens, if we add a double quote every time there is a string, we end up with a generation which is almost as big as the one obtained in JSON, so transitioning to YAML would not make sense.

@lapp0
Copy link
Contributor

lapp0 commented Aug 12, 2024

I am also making sure that we support both quoted and unquoted strings for YAML. Since one of the main advantages of using YAML for guided generation is that there are less tokens, if we add a double quote every time there is a string, we end up with a generation which is almost as big as the one obtained in JSON, so transitioning to YAML would not make sense.

We may want to smoke test qualitative generation performance when using YAML. Out of scope for this PR, but disallowing quotes may, in some cases, confuse the model.

the implementation differs from what @lapp0 has on his branch, which would mean we would have to do some refactoring at some point.

This is fine for now.

Thanks for getting this working! Please let me know if this PR is ready for review.

@patricebechard
Copy link
Contributor Author

Not exactly sure how the coverage is computed here. It says the coverage for yaml_schema.py is ~1% although it should be higher. Any idea how to remedy this? This did not happen previously from what I understand.

@lapp0
Copy link
Contributor

lapp0 commented Aug 19, 2024

Change in coverage may be related to #1089

I've created a separate issue to address this problem #1105

Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting json_schema.py types fix.

Good work integrating the additional yaml tests. I'd like to smoke test a bit before we merge as well to ensure there aren't some edge cases we're missing.

)
DATE = r'("(?:\d{4})-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2][0-9]|3[0-1])"|\'(?:\d{4})-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2][0-9]|3[0-1])\'|(?:\d{4})-(?:0[1-9]|1[0-2])-(?:0[1-9]|[1-2][0-9]|3[0-1]))'
TIME = r'("(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(\\.[0-9]+)?(Z)?"|\'(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(\\.[0-9]+)?(Z)?\'|(2[0-3]|[01][0-9]):([0-5][0-9]):([0-5][0-9])(\\.[0-9]+)?(Z)?)'
UUID = r'("[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}"|\'[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\'|[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12})'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this only be in yaml_schema.py? It allows for single-quote and no-quote datetime, date, time, and UUID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement JSON structured generation Linked to structured generation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translate Pydantic models into a regular expression that accept the corresponding YAML
3 participants