Integrate PCRE capture handling #12

silentbicycle · 2023-02-17T00:01:17Z

This integrates a new approach for handling captures in one or more PCRE regexes.

This a draft because:

A few tests in capture_test_case_list.c are currently marked as SHOULD_SKIP, such as the one for ^(?:($|x))+$. They are all variants of the same underlying issue, which is that repetition with a + surrounding multiple ALT cases where earlier subtrees can match without consuming input is currently handled incorrectly -- I have confirmed that my intended fix for the generated capture resolution opcodes works, but I'm still working on flagging that case without false positives.
Similarly, code generation is not implemented for {count,} where count is > 1. This isn't a deep issue, (?:abc){3,} should desugar to (?:abc){3}(?:abc)*, I just haven't set it up yet.
Some of the modes for re aren't enabled, such as capture resolution for multiple files.
The capture VM's path node sharing needs more testing, particularly in combination with the collapsed zero prefix.
We should be able to generate worst case memory usage info at regex compile time, which could be used by the caller to stack-allocate / reuse a preallocated buffer, but I haven't set that up yet.
Furthermore, we should be able to do capture resolution in a fixed preallocated amount of memory, if replay of the "correct" match prefix is interleaved with search (once all surviving threads agree on the prefix), with incremental resetting when resuming from the all-zero prefix. I haven't implemented this yet either, because it's likely to be fairly complicated, but it may be worth doing later once we have more information about typical memory usage in the fleet.

Test Data

Getting test data from capture_test_case_list.c: Build and run the tests, then run build/tests/capture/run_test_case_list -vvvvvvvvv (yes, nine '-v's). This will write out the generated programs and character classes for each test into prog_output, in a format that should be relatively easy to scrape. Sorry about the hackiness here -- I'm going to replace libfsm's output generation next, which will give a much better interface for converting regexes to standalone test data.

The test case output looks like this:

==== test_case 38
regex "$(x?)(y?)(z?)", input "a", match SHOULD_MATCH, no_nl 0, count 4: 0:[1, 1] 1:[1, 1] 2:[1, 1] 3:[1, 1]
fsm_capture_dump_active_for_ends:
 -- state 0: value 0
 -- state 0: value 1
 -- state 0: value 2
 -- state 0: value 3

==== fsm_capture_dump_programs:
# program 0, capture_count 4, base 0
0: split cont 3 new 1
1: charclass 0 -> [ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff]
2: jmp 0
3: save 0 (cap 0, start)
4: anchor end
5: save 2 (cap 1, start)
6: split cont 7 new 8
7: char 0x78 (x)
8: save 3 (cap 1, end)
9: save 4 (cap 2, start)
10: split cont 11 new 12
11: char 0x79 (y)
12: save 5 (cap 2, end)
13: save 6 (cap 3, start)
14: split cont 15 new 16
15: char 0x7a (z)
16: save 7 (cap 3, end)
17: save 1 (cap 0, end)
18: match
char_class 0: 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff

fsm_capture_dump_program_end_mapping:
 -- state 0: value 0

The fsm_capture_dump_program_end_mapping and fsm_capture_dump_active_for_ends can be ignored, if testing a port of the capture VM to another language you only need the following:

regex "$(x?)(y?)(z?)", input "a", match SHOULD_MATCH, no_nl 0, count 4: 0:[1, 1] 1:[1, 1] 2:[1, 1] 3:[1, 1]
the program (opcodes are described in capture_vm_program.h)
its char_class list (there may be zero)

and then check that executing the program with the same input sets the capture buffers correctly, here to:

- 0: 1,1
- 1: 1,1
- 2: 1,1
- 3: 1,1

The capture VM is in src/libfsm/capture_vm*.

This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs. amend: gen tests

This is a big commit, unfortunately difficult to break apart further due to interface changes, metadata being passed through whole-FSM transformations, and so on. Sorry about that. - Delete code related to capture action metadata on edges. That approach made FSM transformations (determinisation, minimisation, etc.) considerably more expensive, and there were some corner cases that I wasn't able to get working correctly. - Switch to a somewhat simpler method, adapted from Russ Cox's "Regular Expression Matching: the Virtual Machine Approach". Since the capture resolution metadata (an opcode program for a virtual machine) is associated with individual end states, this combines cleanly when multiple regexes are unioned into a single large DFA that matches them all at once. - Add lots of capture regression tests, mostly from using libfsm's `fsm_generate_matches` and a fuzzer to compare behavior against PCRE. This brought many, many obscure cases to light. - Delete capture tests based on the old interface. The new one does not work with state machines built manually using libfsm's interafces, only via compilation from regex. - Some performance improvements to trimming and minimisation, mostly due to better utilizing bit-parallelism in the edge set data structure. - Switch to using new ADTs in several places. amend: interface changes

See katef#386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.

Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.

To build, uncomment `PCRE_CMP=1` in fuzz/Makefile. This depends on libpcre2-8.

This was pretty much the minimal amount I needed for manual testing. - `-FC`: No captures, do not build capture metadata even when dialects support it. - Generating input: `build/bin/re -rpcre -G10 '^a*b*$'` - Capture resolution: `build/bin/re -rpcre -R '^a(b*)c$' abbbbbc` -- 0: 0,7 -- 1: 1,6 - Multiple regexes: `build/bin/re -rpcre -pgC -Y file_with_one_regex_per_line` Capture resolution is not implemented for -y or multi-regex yet, because `fsm_exec_with_captures` needs all input buffered ahead of time and because multi-regex isn't passing along the capture bases and matching those up based on the result.

There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.

silentbicycle · 2023-02-17T00:14:35Z

There are a couple CI failures:

some related to the {3,} case I mentioned above returning a "not yet implemented" error.
some related to EFENCE rejecting allocating zero bytes. That's probably not a bug, but I'll confirm.
Those will get fixed before I change this from a draft PR.

katef · 2023-02-28T02:47:54Z

include/re/re.h

@@ -64,6 +65,8 @@ enum re_errno {
 	RE_EERRNO       =  1 | RE_MISC,
 	RE_EBADDIALECT  =  2 | RE_MISC,
 	RE_EBADGROUP    =  3 | RE_MISC,
+	RE_EUNSUPCAPTUR =  4 | RE_MISC,


Also, add a comment about how the iterators yield collated .to states, so we can avoid the overhead of sorting later.

Update several skipped tests. These should now be run, and expect libfsm to reject them as unsupported. These will be fixed by the next commit. Also, fix the test runner's handling of unsupported inputs.

Expand analysis to detect and reject this special case. It's not likely to be worth supporting, but was previously not identified correctly at compile-time. This has not been fuzzed yet, but with it all tests pass, including several that were previously set to SKIP.

Add regression test, found via fuzzing.

silentbicycle · 2023-03-08T18:45:46Z

The new commits I just pushed address the capture tests that were marked as SKIP. All tests are passing locally, but this needs further fuzzer burn-in time.

silentbicycle · 2023-09-11T17:45:18Z

This is going to go in PRs for the public upstream repo (starting with katef#440), then will merge here as part of a sync from upstream.

silentbicycle added 17 commits February 16, 2023 13:44

include/adt/hash.h: Centralize hashing functions.

8269df1

include/adt/pv.h: Permutation vector building.

36ed4eb

Add src/adt/idmap.c, a state -> ID set map.

377c67d

Add fsm_generate_matches (src/libfsm/gen.c).

41c8e54

This is mainly used for fuzz testing -- we can use gen to walk a DFA to generate matching input strings up to a certain length, so then we can compare capture behavior against PCRE for those particular inputs. amend: gen tests

Bugfix: fsm.c: Set dst->hasstart in fsm_move.

ef08bab

UBSAN: Add char signedness casts.

7ef4a6c

parser.act: Avoid crash in parser from '(*:'.

9ed77c7

See katef#386 on katef/libfsm. This is a workaround for a bug in the parser -- once the fuzzer finds it, it tends to get in the way of finding deeper issues.

print/fsm.c: Use u64bitset ADT.

a729833

ast_rewrite: Make ast_rewrite's ALT case deduplication preserve order.

9089520

Previously it sorted the ALT case subtrees to find and discard unique ones, but capture results are affected by ALT case ordering, so we need to preserve ordering while eliminating duplicates.

Makefiles: use -std=c99

f584a9c

lx/parser.act: Disable captures.

8b6ce44

libre/print/tree.c: Add analysis metadata to AST tree printout.

6428f5e

fuzz/target.c: Update fuzz harness with PCRE comparison modes.

16676e3

To build, uncomment `PCRE_CMP=1` in fuzz/Makefile. This depends on libpcre2-8.

tests/*/Makefile: Add -FC (no captures) for some calls to RE.

719acc6

There are several tests that have nothing to do with captures, capture behavior is tested directly with `tests/captures/`.

tests/pcre-anchor/in81.re: Add test for anchoring edge case.

efeae03

silentbicycle requested review from dgryski and katef February 17, 2023 00:01

katef reviewed Feb 28, 2023

View reviewed changes

Base automatically changed from sv/sync-fastly-main-with-upstream to main February 28, 2023 21:28

silentbicycle added 6 commits March 2, 2023 09:20

internedstateset: If EXPENSIVE_CHECKS, add check for sorted input.

1f9d673

Also, add a comment about how the iterators yield collated .to states, so we can avoid the overhead of sorting later.

re_capvm_compile: Extract compile_kleen_star, also use for e.g. x{3,}.

e9c59bc

capture tests: SHOULD_SKIP -> SHOULD_REJECT_AS_UNSUPPORTED

f3e69c3

Update several skipped tests. These should now be run, and expect libfsm to reject them as unsupported. These will be fixed by the next commit. Also, fix the test runner's handling of unsupported inputs.

minimise: Add missing call to INIT_TIMERS().

0fd0f12

ast_analysis: Ensure unuspported code path is flagged as unsatisfiable.

fddcf6f

Add regression test, found via fuzzing.

ast_analysis: Correct overly strict assertion.

f39528a

silentbicycle closed this Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate PCRE capture handling #12

Integrate PCRE capture handling #12

silentbicycle commented Feb 17, 2023 •

edited

Loading

silentbicycle commented Feb 17, 2023 •

edited

Loading

katef Feb 28, 2023

silentbicycle commented Mar 8, 2023

silentbicycle commented Sep 11, 2023

Integrate PCRE capture handling #12

Integrate PCRE capture handling #12

Conversation

silentbicycle commented Feb 17, 2023 • edited Loading

Test Data

silentbicycle commented Feb 17, 2023 • edited Loading

katef Feb 28, 2023

Choose a reason for hiding this comment

silentbicycle commented Mar 8, 2023

silentbicycle commented Sep 11, 2023

silentbicycle commented Feb 17, 2023 •

edited

Loading

silentbicycle commented Feb 17, 2023 •

edited

Loading