Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PCRE capture handling #12

Closed
wants to merge 24 commits into from

Conversation

silentbicycle
Copy link

@silentbicycle silentbicycle commented Feb 17, 2023

This integrates a new approach for handling captures in one or more PCRE regexes.

This a draft because:

  • A few tests in capture_test_case_list.c are currently marked as SHOULD_SKIP, such as the one for ^(?:($|x))+$. They are all variants of the same underlying issue, which is that repetition with a + surrounding multiple ALT cases where earlier subtrees can match without consuming input is currently handled incorrectly -- I have confirmed that my intended fix for the generated capture resolution opcodes works, but I'm still working on flagging that case without false positives.
  • Similarly, code generation is not implemented for {count,} where count is > 1. This isn't a deep issue, (?:abc){3,} should desugar to (?:abc){3}(?:abc)*, I just haven't set it up yet.
  • Some of the modes for re aren't enabled, such as capture resolution for multiple files.
  • The capture VM's path node sharing needs more testing, particularly in combination with the collapsed zero prefix.
  • We should be able to generate worst case memory usage info at regex compile time, which could be used by the caller to stack-allocate / reuse a preallocated buffer, but I haven't set that up yet.
  • Furthermore, we should be able to do capture resolution in a fixed preallocated amount of memory, if replay of the "correct" match prefix is interleaved with search (once all surviving threads agree on the prefix), with incremental resetting when resuming from the all-zero prefix. I haven't implemented this yet either, because it's likely to be fairly complicated, but it may be worth doing later once we have more information about typical memory usage in the fleet.

Test Data

Getting test data from capture_test_case_list.c: Build and run the tests, then run build/tests/capture/run_test_case_list -vvvvvvvvv (yes, nine '-v's). This will write out the generated programs and character classes for each test into prog_output, in a format that should be relatively easy to scrape. Sorry about the hackiness here -- I'm going to replace libfsm's output generation next, which will give a much better interface for converting regexes to standalone test data.

The test case output looks like this:

==== test_case 38
regex "$(x?)(y?)(z?)", input "a", match SHOULD_MATCH, no_nl 0, count 4: 0:[1, 1] 1:[1, 1] 2:[1, 1] 3:[1, 1]
fsm_capture_dump_active_for_ends:
 -- state 0: value 0
 -- state 0: value 1
 -- state 0: value 2
 -- state 0: value 3

==== fsm_capture_dump_programs:
# program 0, capture_count 4, base 0
0: split cont 3 new 1
1: charclass 0 -> [ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff]
2: jmp 0
3: save 0 (cap 0, start)
4: anchor end
5: save 2 (cap 1, start)
6: split cont 7 new 8
7: char 0x78 (x)
8: save 3 (cap 1, end)
9: save 4 (cap 2, start)
10: split cont 11 new 12
11: char 0x79 (y)
12: save 5 (cap 2, end)
13: save 6 (cap 3, start)
14: split cont 15 new 16
15: char 0x7a (z)
16: save 7 (cap 3, end)
17: save 1 (cap 0, end)
18: match
char_class 0: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff

fsm_capture_dump_program_end_mapping:
 -- state 0: value 0

The fsm_capture_dump_program_end_mapping and fsm_capture_dump_active_for_ends can be ignored, if testing a port of the capture VM to another language you only need the following:

  • regex "$(x?)(y?)(z?)", input "a", match SHOULD_MATCH, no_nl 0, count 4: 0:[1, 1] 1:[1, 1] 2:[1, 1] 3:[1, 1]
  • the program (opcodes are described in capture_vm_program.h)
  • its char_class list (there may be zero)

and then check that executing the program with the same input sets the capture buffers correctly, here to:

- 0: 1,1
- 1: 1,1
- 2: 1,1
- 3: 1,1

The capture VM is in src/libfsm/capture_vm*.

This is mainly used for fuzz testing -- we can use gen to
walk a DFA to generate matching input strings up to a certain
length, so then we can compare capture behavior against PCRE
for those particular inputs.

amend: gen tests
This is a big commit, unfortunately difficult to break apart
further due to interface changes, metadata being passed through
whole-FSM transformations, and so on. Sorry about that.

- Delete code related to capture action metadata on edges.
  That approach made FSM transformations (determinisation,
  minimisation, etc.) considerably more expensive, and there
  were some corner cases that I wasn't able to get working
  correctly.

- Switch to a somewhat simpler method, adapted from Russ Cox's
  "Regular Expression Matching: the Virtual Machine Approach".
  Since the capture resolution metadata (an opcode program
  for a virtual machine) is associated with individual end
  states, this combines cleanly when multiple regexes are
  unioned into a single large DFA that matches them all at once.

- Add lots of capture regression tests, mostly from using libfsm's
  `fsm_generate_matches` and a fuzzer to compare behavior against
  PCRE. This brought many, many obscure cases to light.

- Delete capture tests based on the old interface. The new one
  does not work with state machines built manually using libfsm's
  interafces, only via compilation from regex.

- Some performance improvements to trimming and minimisation,
  mostly due to better utilizing bit-parallelism in the edge
  set data structure.

- Switch to using new ADTs in several places.

amend: interface changes
See katef#386 on katef/libfsm.

This is a workaround for a bug in the parser -- once the fuzzer
finds it, it tends to get in the way of finding deeper issues.
Previously it sorted the ALT case subtrees to find and discard unique
ones, but capture results are affected by ALT case ordering, so we
need to preserve ordering while eliminating duplicates.
To build, uncomment `PCRE_CMP=1` in fuzz/Makefile. This depends on
libpcre2-8.
This was pretty much the minimal amount I needed for manual testing.

- `-FC`: No captures, do not build capture metadata even when dialects
  support it.

- Generating input: `build/bin/re -rpcre -G10 '^a*b*$'`

- Capture resolution: `build/bin/re -rpcre -R '^a(b*)c$' abbbbbc`

    -- 0: 0,7
    -- 1: 1,6

- Multiple regexes: `build/bin/re -rpcre -pgC -Y file_with_one_regex_per_line`

Capture resolution is not implemented for -y or multi-regex yet, because
`fsm_exec_with_captures` needs all input buffered ahead of time and
because multi-regex isn't passing along the capture bases and matching
those up based on the result.
There are several tests that have nothing to do with captures,
capture behavior is tested directly with `tests/captures/`.
@silentbicycle
Copy link
Author

silentbicycle commented Feb 17, 2023

There are a couple CI failures:

  • some related to the {3,} case I mentioned above returning a "not yet implemented" error.
  • some related to EFENCE rejecting allocating zero bytes. That's probably not a bug, but I'll confirm.
    Those will get fixed before I change this from a draft PR.

@@ -64,6 +65,8 @@ enum re_errno {
RE_EERRNO = 1 | RE_MISC,
RE_EBADDIALECT = 2 | RE_MISC,
RE_EBADGROUP = 3 | RE_MISC,
RE_EUNSUPCAPTUR = 4 | RE_MISC,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello Ken

Base automatically changed from sv/sync-fastly-main-with-upstream to main February 28, 2023 21:28
Also, add a comment about how the iterators yield collated .to states,
so we can avoid the overhead of sorting later.
Update several skipped tests. These should now be run, and
expect libfsm to reject them as unsupported. These will be
fixed by the next commit.

Also, fix the test runner's handling of unsupported inputs.
Expand analysis to detect and reject this special case. It's not
likely to be worth supporting, but was previously not identified
correctly at compile-time.

This has not been fuzzed yet, but with it all tests pass, including
several that were previously set to SKIP.
@silentbicycle
Copy link
Author

The new commits I just pushed address the capture tests that were marked as SKIP. All tests are passing locally, but this needs further fuzzer burn-in time.

@silentbicycle
Copy link
Author

This is going to go in PRs for the public upstream repo (starting with katef#440), then will merge here as part of a sync from upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants