Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate Pydantic models into a regular expression that accept the corresponding YAML #923

Open
rlouf opened this issue May 27, 2024 · 2 comments · May be fixed by #1022 or #1182
Open

Translate Pydantic models into a regular expression that accept the corresponding YAML #923

rlouf opened this issue May 27, 2024 · 2 comments · May be fixed by #1022 or #1182

Comments

@rlouf
Copy link
Member

rlouf commented May 27, 2024

No description provided.

@lapp0
Copy link
Contributor

lapp0 commented May 29, 2024

I love the idea, yaml uses fewer syntactic tokens and allows language models to generate without needing to keep track of as much "nesting" / context.

Here's what I'm thinking for a strategy, would love to hear your thoughts:

We should refactor fsm/json_schema.py so it uses a class-based approach with handler methods for each type. Then we can subclass to implement the different behavior in yaml.

class JSONSchemaRegexGenerator:
    def __init__(self):
        self.handlers = {
            "string": self.handle_string,
            "array": self.handle_array,
            ...
        }

    @classmethod
    def get_pattern(cls, schema):
        return cls().handle_node(schema)

    def get_pattern(self, node):
        handler = self.handlers.get(node["type"], self.handle_default)
        return handler(node)

    def handle_string(self, node):
        return STRING

    def handle_array(self, node):
        ...
        return rf"\[{whitespace_pattern}({'|'.join(regexes)})(,{whitespace_pattern}({'|'.join(regexes)})){num_repeats}){allow_empty}{whitespace_pattern}\]"


class YAMLSchemaRegexGenerator(JSONSchemaRegexGenerator):
    def handle_array(self, node):
        """handle format for yaml arrays:
            - elem0
            - elem1
        """
        ...     

This would make the code more readable, extensible, reduce technical debt, and make it so we don't have to have conditional handling for a passed is_yaml for many rules within to_regex()

@rlouf
Copy link
Member Author

rlouf commented Jun 5, 2024

I can get on board with this. To follow ast.NodeVisitor's naming scheme we could name the handlers visit_X. I think we should first implement a first version of the converter to YAML with only a few primitives before refactoring.

@rlouf rlouf mentioned this issue Jun 13, 2024
@patricebechard patricebechard linked a pull request Jul 6, 2024 that will close this issue
@rlouf rlouf linked a pull request Aug 21, 2024 that will close this issue
@lapp0 lapp0 linked a pull request Oct 1, 2024 that will close this issue
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants