Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow transition from end states to self with EOS as the only token allowed #606

Closed
wants to merge 4 commits into from

Conversation

viktor-ferenczi
Copy link

@viktor-ferenczi viktor-ferenczi commented Feb 3, 2024

Fixes #605

@viktor-ferenczi viktor-ferenczi changed the title Fixes #605: Crash with KeyError on certain regex Fix for crash with KeyError on certain regex Feb 3, 2024
@viktor-ferenczi
Copy link
Author

@rlouf Simple crash fix

Copy link
Contributor

@lapp0 lapp0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, makes sense, haven't smoke tested. Would prefer a test case which fails before this change and passes after though.

@viktor-ferenczi
Copy link
Author

viktor-ferenczi commented Feb 4, 2024

A simpler regex to reproduce: r'`\n(\.\n)?`\n'

In this case state 5 is missing from states_to_token_maps, while 6 exists.

The regression test will verify the resulting states_to_token_maps.

@viktor-ferenczi
Copy link
Author

viktor-ferenczi commented Feb 4, 2024

Added a regression test case, which fails before the fix: test_regression_regex_missing_final_state

Apparently test_regex has to be fixed to expect the final state in states_to_token_maps. I think that test case was wrong, that's why the bug wasn't detected originally.

Please note that it is much cleaner if we use visibly different token IDs from the state numbers, like token IDs above 100 in the case of my test. This makes the values in states_to_token_maps much more readable in tests.

With the fix applied test_regex_vocabulary_error fails, because it does not get the expected exception. I think the following check is fishy in RegexFSM.__init__:

            # We make sure that it is possible to generate strings in the language
            # of the regular expression with the tokens present in the model's
            # vocabulary.
            if not any(
                regex_fsm.finals.intersection(v.values())
                for v in states_to_token_maps.values()
            ):
                raise ValueError(
                    "The vocabulary does not allow us to build a sequence that matches the input regex"
                )

I think such tests should not remain in the production code, but go into test cases.

Also, the cacheable_vocabulary argument of the create_states_mapping inner function is not used at all, which seems to point to unfinished code.

It reproduces the case where state 5 is missing from the generated `fsm.states_to_token_maps`.
This test case fails now, which is expected until the fix is applied.
# Allow transitions to EOS from all terminals FSM states that are
# reachable
# TODO: Do we really need this anymore?
# Allow transitions to EOS from all terminals FSM states that are reachable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be covered by this function. We should understand why it isn't in your case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can get on board with this, but if I understand well we should be able to remove the if state in self.final_states: return [eos_token_id] in RegexFSM?

@rlouf rlouf changed the title Fix for crash with KeyError on certain regex Allow transition from end states to self with EOS as the only allowed token allowed Feb 12, 2024
@rlouf rlouf changed the title Allow transition from end states to self with EOS as the only allowed token allowed Allow transition from end states to self with EOS as the only token allowed Feb 12, 2024
@rlouf
Copy link
Member

rlouf commented Mar 6, 2024

I think the core issue is that a recent change removed the actual FSM final states from RegexFSM's set of final states. This must have caused the issue that you observed. I will revert this change and add your test case to see if it runs. I will also add a test case for when a regex leads to a FSM with several final states.

@rlouf
Copy link
Member

rlouf commented Mar 7, 2024

Should be fixed by #734. Closing for now.

@rlouf rlouf closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crash with KeyError due to missing key in states_to_token_maps
3 participants