-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate PCRE capture handling #12
Conversation
This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs. amend: gen tests
This is a big commit, unfortunately difficult to break apart further due to interface changes, metadata being passed through whole-FSM transformations, and so on. Sorry about that. - Delete code related to capture action metadata on edges. That approach made FSM transformations (determinisation, minimisation, etc.) considerably more expensive, and there were some corner cases that I wasn't able to get working correctly. - Switch to a somewhat simpler method, adapted from Russ Cox's "Regular Expression Matching: the Virtual Machine Approach". Since the capture resolution metadata (an opcode program for a virtual machine) is associated with individual end states, this combines cleanly when multiple regexes are unioned into a single large DFA that matches them all at once. - Add lots of capture regression tests, mostly from using libfsm's `fsm_generate_matches` and a fuzzer to compare behavior against PCRE. This brought many, many obscure cases to light. - Delete capture tests based on the old interface. The new one does not work with state machines built manually using libfsm's interafces, only via compilation from regex. - Some performance improvements to trimming and minimisation, mostly due to better utilizing bit-parallelism in the edge set data structure. - Switch to using new ADTs in several places. amend: interface changes
See katef#386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.
Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.
To build, uncomment `PCRE_CMP=1` in fuzz/Makefile. This depends on libpcre2-8.
This was pretty much the minimal amount I needed for manual testing. - `-FC`: No captures, do not build capture metadata even when dialects support it. - Generating input: `build/bin/re -rpcre -G10 '^a*b*$'` - Capture resolution: `build/bin/re -rpcre -R '^a(b*)c$' abbbbbc` -- 0: 0,7 -- 1: 1,6 - Multiple regexes: `build/bin/re -rpcre -pgC -Y file_with_one_regex_per_line` Capture resolution is not implemented for -y or multi-regex yet, because `fsm_exec_with_captures` needs all input buffered ahead of time and because multi-regex isn't passing along the capture bases and matching those up based on the result.
There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.
There are a couple CI failures:
|
@@ -64,6 +65,8 @@ enum re_errno { | |||
RE_EERRNO = 1 | RE_MISC, | |||
RE_EBADDIALECT = 2 | RE_MISC, | |||
RE_EBADGROUP = 3 | RE_MISC, | |||
RE_EUNSUPCAPTUR = 4 | RE_MISC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello Ken
Also, add a comment about how the iterators yield collated .to states, so we can avoid the overhead of sorting later.
Update several skipped tests. These should now be run, and expect libfsm to reject them as unsupported. These will be fixed by the next commit. Also, fix the test runner's handling of unsupported inputs.
Expand analysis to detect and reject this special case. It's not likely to be worth supporting, but was previously not identified correctly at compile-time. This has not been fuzzed yet, but with it all tests pass, including several that were previously set to SKIP.
Add regression test, found via fuzzing.
The new commits I just pushed address the capture tests that were marked as |
This is going to go in PRs for the public upstream repo (starting with katef#440), then will merge here as part of a sync from upstream. |
This integrates a new approach for handling captures in one or more PCRE regexes.
This a draft because:
capture_test_case_list.c
are currently marked asSHOULD_SKIP
, such as the one for^(?:($|x))+$
. They are all variants of the same underlying issue, which is that repetition with a+
surrounding multiple ALT cases where earlier subtrees can match without consuming input is currently handled incorrectly -- I have confirmed that my intended fix for the generated capture resolution opcodes works, but I'm still working on flagging that case without false positives.{count,}
where count is > 1. This isn't a deep issue,(?:abc){3,}
should desugar to(?:abc){3}(?:abc)*
, I just haven't set it up yet.re
aren't enabled, such as capture resolution for multiple files.Test Data
Getting test data from
capture_test_case_list.c
: Build and run the tests, then runbuild/tests/capture/run_test_case_list -vvvvvvvvv
(yes, nine '-v's). This will write out the generated programs and character classes for each test intoprog_output
, in a format that should be relatively easy to scrape. Sorry about the hackiness here -- I'm going to replace libfsm's output generation next, which will give a much better interface for converting regexes to standalone test data.The test case output looks like this:
The
fsm_capture_dump_program_end_mapping
andfsm_capture_dump_active_for_ends
can be ignored, if testing a port of the capture VM to another language you only need the following:regex "$(x?)(y?)(z?)", input "a", match SHOULD_MATCH, no_nl 0, count 4: 0:[1, 1] 1:[1, 1] 2:[1, 1] 3:[1, 1]
capture_vm_program.h
)and then check that executing the program with the same input sets the capture buffers correctly, here to:
The capture VM is in
src/libfsm/capture_vm*
.