Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store named capture groups in field table #43

Open
darrylabbate opened this issue Dec 4, 2022 · 3 comments
Open

Store named capture groups in field table #43

darrylabbate opened this issue Dec 4, 2022 · 3 comments
Labels
feature New feature or request language Language features/requests regex Regular expressions
Milestone

Comments

@darrylabbate
Copy link
Member

darrylabbate commented Dec 4, 2022

Numbered groups are already stored; forgot to implement named capture groups.

Also, audit the behavior of $abc. Currently, abc would be treated as an expression (variable). To dereference the field table with a named group, you'd need to use a string literal (e.g. $'abc').

If the field table were named/aliased (like arg), you could cleanly dereference using match.group or match[n].

@darrylabbate darrylabbate added feature New feature or request language Language features/requests regex Regular expressions labels Dec 4, 2022
@darrylabbate darrylabbate modified the milestones: Riff 0.4, Riff 0.5 Dec 4, 2022
This was referenced Dec 26, 2022
@darrylabbate
Copy link
Member Author

Also, audit the behavior of $abc. Currently, abc would be treated as an expression (variable). To dereference the field table with a named group, you'd need to use a string literal (e.g. $'abc').

This would be a breaking change, but logically it makes sense for $foo to correspond to the capture group foo

@darrylabbate
Copy link
Member Author

darrylabbate commented Dec 3, 2023

The named capture groups can be extracted from a compiled pattern (pcre2_code *) via pcre2_pattern_info().

  • PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the "name table" (PCRE2_SPTR)
  • PCRE2_INFO_NAMECOUNT returns the number of named capture groups (uint32_t)
  • PCRE2_INFO_NAMEENTRYSIZE returns the size of each entry in the name table (uint32_t), which is essentially the length of the longest capture group name + 3 (8-bit library)
    • First 2 bytes are the corresponding number (big endian) for the capture group
    • Each string is null-terminated

Example pattern and corresponding name table layout:

  (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
  00 01 d  a  t  e  00 ??
  00 05 d  a  y  00 ?? ??
  00 04 m  o  n  t  h  00
  00 02 y  e  a  r  00 ??

Obvious approach:

  • Collect the capture group names upon pattern compilation
  • Extract captured substrings from the match data via pcre2_substring_copy_byname() upon pattern matching

Should look closely at the PCRE2 spec for duplicated group names before doing any optimzations with the number <-> name mapping.

@darrylabbate
Copy link
Member Author

Should look closely at the PCRE2 spec for duplicated group names before doing any optimzations with the number <-> name mapping.


In an attempt to reduce confusion, PCRE2 does not allow the same group number to be associated with more than one name. [...] However, there is still scope for confusion. Consider this pattern:

(?|(?<AA>aa)|(bb))

Although the second group number 1 is not explicitly named, the name AA is still an alias for any group 1. Whether the pattern matches "aa" or "bb", a reference by name to group AA yields the matched string.

(source)


I.e. Number -> name mapping should be safe if needed; even with PCRE2_DUPNAMES. Name -> number mapping isn't safe since a name can correspond to multiple numbered groups.

@darrylabbate darrylabbate pinned this issue Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request language Language features/requests regex Regular expressions
Projects
None yet
Development

No branches or pull requests

1 participant