Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate PCRE capture handling #12

Closed
wants to merge 24 commits into from

Commits on Feb 16, 2023

  1. Configuration menu
    Copy the full SHA
    8269df1 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    36ed4eb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    377c67d View commit details
    Browse the repository at this point in the history
  4. Add fsm_generate_matches (src/libfsm/gen.c).

    This is mainly used for fuzz testing -- we can use gen to
    walk a DFA to generate matching input strings up to a certain
    length, so then we can compare capture behavior against PCRE
    for those particular inputs.
    
    amend: gen tests
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    41c8e54 View commit details
    Browse the repository at this point in the history
  5. Complemely rework capture resoultion.

    This is a big commit, unfortunately difficult to break apart
    further due to interface changes, metadata being passed through
    whole-FSM transformations, and so on. Sorry about that.
    
    - Delete code related to capture action metadata on edges.
      That approach made FSM transformations (determinisation,
      minimisation, etc.) considerably more expensive, and there
      were some corner cases that I wasn't able to get working
      correctly.
    
    - Switch to a somewhat simpler method, adapted from Russ Cox's
      "Regular Expression Matching: the Virtual Machine Approach".
      Since the capture resolution metadata (an opcode program
      for a virtual machine) is associated with individual end
      states, this combines cleanly when multiple regexes are
      unioned into a single large DFA that matches them all at once.
    
    - Add lots of capture regression tests, mostly from using libfsm's
      `fsm_generate_matches` and a fuzzer to compare behavior against
      PCRE. This brought many, many obscure cases to light.
    
    - Delete capture tests based on the old interface. The new one
      does not work with state machines built manually using libfsm's
      interafces, only via compilation from regex.
    
    - Some performance improvements to trimming and minimisation,
      mostly due to better utilizing bit-parallelism in the edge
      set data structure.
    
    - Switch to using new ADTs in several places.
    
    amend: interface changes
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    d997649 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    ef08bab View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    7ef4a6c View commit details
    Browse the repository at this point in the history
  8. parser.act: Avoid crash in parser from '(*:'.

    See katef#386 on katef/libfsm.
    
    This is a workaround for a bug in the parser -- once the fuzzer
    finds it, it tends to get in the way of finding deeper issues.
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    9ed77c7 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    a729833 View commit details
    Browse the repository at this point in the history
  10. ast_rewrite: Make ast_rewrite's ALT case deduplication preserve order.

    Previously it sorted the ALT case subtrees to find and discard unique
    ones, but capture results are affected by ALT case ordering, so we
    need to preserve ordering while eliminating duplicates.
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    9089520 View commit details
    Browse the repository at this point in the history
  11. Makefiles: use -std=c99

    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    f584a9c View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    8b6ce44 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    6428f5e View commit details
    Browse the repository at this point in the history
  14. fuzz/target.c: Update fuzz harness with PCRE comparison modes.

    To build, uncomment `PCRE_CMP=1` in fuzz/Makefile. This depends on
    libpcre2-8.
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    16676e3 View commit details
    Browse the repository at this point in the history
  15. re: Add flags for generating input, capture resolution, multi-regexes.

    This was pretty much the minimal amount I needed for manual testing.
    
    - `-FC`: No captures, do not build capture metadata even when dialects
      support it.
    
    - Generating input: `build/bin/re -rpcre -G10 '^a*b*$'`
    
    - Capture resolution: `build/bin/re -rpcre -R '^a(b*)c$' abbbbbc`
    
        -- 0: 0,7
        -- 1: 1,6
    
    - Multiple regexes: `build/bin/re -rpcre -pgC -Y file_with_one_regex_per_line`
    
    Capture resolution is not implemented for -y or multi-regex yet, because
    `fsm_exec_with_captures` needs all input buffered ahead of time and
    because multi-regex isn't passing along the capture bases and matching
    those up based on the result.
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    f753c72 View commit details
    Browse the repository at this point in the history
  16. tests/*/Makefile: Add -FC (no captures) for some calls to RE.

    There are several tests that have nothing to do with captures,
    capture behavior is tested directly with `tests/captures/`.
    silentbicycle committed Feb 16, 2023
    Configuration menu
    Copy the full SHA
    719acc6 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    efeae03 View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2023

  1. internedstateset: If EXPENSIVE_CHECKS, add check for sorted input.

    Also, add a comment about how the iterators yield collated .to states,
    so we can avoid the overhead of sorting later.
    silentbicycle committed Mar 2, 2023
    Configuration menu
    Copy the full SHA
    1f9d673 View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2023

  1. Configuration menu
    Copy the full SHA
    e9c59bc View commit details
    Browse the repository at this point in the history
  2. capture tests: SHOULD_SKIP -> SHOULD_REJECT_AS_UNSUPPORTED

    Update several skipped tests. These should now be run, and
    expect libfsm to reject them as unsupported. These will be
    fixed by the next commit.
    
    Also, fix the test runner's handling of unsupported inputs.
    silentbicycle committed Mar 6, 2023
    Configuration menu
    Copy the full SHA
    f3e69c3 View commit details
    Browse the repository at this point in the history
  3. ast_analysis: Implement rejection for e.g. ^(($)|x)+$.

    Expand analysis to detect and reject this special case. It's not
    likely to be worth supporting, but was previously not identified
    correctly at compile-time.
    
    This has not been fuzzed yet, but with it all tests pass, including
    several that were previously set to SKIP.
    silentbicycle committed Mar 6, 2023
    Configuration menu
    Copy the full SHA
    5badf27 View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2023

  1. Configuration menu
    Copy the full SHA
    0fd0f12 View commit details
    Browse the repository at this point in the history
  2. ast_analysis: Ensure unuspported code path is flagged as unsatisfiable.

    Add regression test, found via fuzzing.
    silentbicycle committed Mar 8, 2023
    Configuration menu
    Copy the full SHA
    fddcf6f View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2023

  1. Configuration menu
    Copy the full SHA
    f39528a View commit details
    Browse the repository at this point in the history