Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more segfault during regex parsing #451

Merged
merged 2 commits into from
Nov 4, 2024
Merged

No more segfault during regex parsing #451

merged 2 commits into from
Nov 4, 2024

Conversation

koniksedy
Copy link
Collaborator

@koniksedy koniksedy commented Oct 30, 2024

This PR:

  • Eliminates a segmentation fault that occurred during regex parsing.
  • The segmentation fault was caused by a state renaming function, which also renamed initial states that were not marked as final and had no outgoing edges, thus lacking a specified new name (default name was mata::nfa::Limits::max_state).
  • Fixes the segmentation fault mentioned in Segmentation fault when parsing regexes #437.

@koniksedy koniksedy requested review from jurajsic and Adda0 October 30, 2024 19:11
Copy link
Collaborator

@Adda0 Adda0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was the .mata files parser, I would say it might not be the intention to trim every automaton implicitly. You could not, for example, parse some template of an automaton which you would later modify. But for regex parser, this should not be the case, and you always want to get the most succinct representation of the regex. Under this assumption, the logic should be fine.

However, since we have effectively removed the renumber states function, I would just use nfa::Nfa::trim() directly at the place of calling re2parser::renumber_states(), but maybe even keep the original re2parser::renumber_states() (and fix it), or remove it entirely.

Have you checked the performance impact of this change? Was there any significant change in the performance of the regex parser?

@Adda0
Copy link
Collaborator

Adda0 commented Oct 31, 2024

After discussing this with @koniksedy in person, we decided to remove the renumber_states() function and only use trim(). Performing trimming is slightly slower, but the performance decrease in not significant (see the GH Actions checks). Testing on larger regexes shows no performance degradation whatsoever.

@Adda0
Copy link
Collaborator

Adda0 commented Oct 31, 2024

By the way, does it really fix the linked issue? It might be good to add one (or more) of the regexes from #437 to our unit tests suite to ensure this issue is not encountered again.

@koniksedy
Copy link
Collaborator Author

It does solve the segmentation fault problem mentioned in #437. However, all regular expressions provided in #437 result in an empty automaton. This behavior has already been reported in issue #450.

@Adda0
Copy link
Collaborator

Adda0 commented Oct 31, 2024

Right. Therefore, I would not close the issue #437 and keep the issue open until we resolve #450.

@koniksedy
Copy link
Collaborator Author

koniksedy commented Oct 31, 2024

Performance results

single_param_jobs

metric                  Devel (avg)    Devel (med)      This (avg)      This (med)
--------------------  --------------  --------------  ----------------  ----------------
nfa-mintermization             0.011           0.002             0.012             0.002
nfa-parsing                    0.035           0.007             0.039             0.008
nfa-runtime                    0.036           0.000             0.037             0.000
nfa-trim                       0.000           0.000             0.000             0.000
unary-mintermization           0.010           0.002             0.010             0.002
unary-parsing                  0.033           0.007             0.034             0.007
unary-runtime                  0.294           0.010             0.309             0.010

double_param_jobs

metric                        Devel (avg)    Devel (med)      This (avg)      This (med)
--------------------------  --------------  --------------  ----------------  ----------------
binary-antichain_inclusion           0.002           0.000             0.002             0.000
binary-concatenation                 0.000           0.000             0.000             0.000
binary-intersection                  0.002           0.000             0.002             0.000
binary-mintermization                0.009           0.004             0.009             0.004
binary-naive_inclusion               0.100           0.001             0.106             0.001
binary-parsing                       0.030           0.013             0.034             0.016
binary-runtime                       0.127           0.010             0.136             0.010
binary-union                         0.000           0.000             0.000             0.000

The performance drop is negligible.

A set of regular expressions

.*[sS][yY][sS][tT][eE][mM][pP][aA][tT][hH]\\=(([hH][tT]{2}[pP][sS]?)|([fF][tT][pP]))
[^\\f\\n\\r\\t\\v]{65}|[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]{65}|[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]{65}
get (X.downloadX[ -~]*|X.supernode[ -~]|X.status[ -~]|X.network[ -~]*|X.files|X.hash\\=[0-9a-f]*X[ -~]*) httpX1.1|user-agent: kazaa|x-kazaa(-username|-network|-ip|-supernodeip|-xferid|-xferuid|tag)|^give [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]?[0-9]?[0-9]?
(\\*[\\x01\\x02].*\\x03\\x0b|\\*\\x01.?.?.?.?\\x01)|flapon|toc_signon.*0x
(\\x11\\x20\\x01...?\\x11|\\xfe\\xfd.?.?.?.?.?.?(\\x14\\x01\\x06|\\xff\\xff\\xff))|[\\]\\x01].?battlefield2
(\\x13bittorrent protocol|azver\\x01$|get Xscrape\\?info_hash\\=get Xannounce\\?info_hash\\=|get XclientXbitcometX|GET Xdata\\?fid\\=)|d1:ad2:id20:|\\x08'7P\\)[RP]
[a-z][a-z0-9\\-_]+|login: [\\x09-\\x0d -~]* name: [\\x09-\\x0d -~]* Directory:
get X.*icy-metadata:1|icy [1-5][0-9][0-9] [\\x09-\\x0d -~]*(content-type:audio|icy-)
([()]|get)(...?.?.?(reg|get|query)|.+User-Agent: (MozillaX4\\.0 \\(compatible; (MSIE 6\\.0; Windows NT 5\\.1;? ?\\)|MSIE 5\\.00; Windows 98\\))))|Keep-Alive\\x0d\\x0a\\x0d\\x0a[26]
\\x2f([defghilmnoqrstwz])(m(ookflolfctm\\x2fnmot\\.fmu|clvompycem\\x2fcen\\.vcn)|e(etbuviaebe\\x2feqv\\.bvv|mcndvwoemn\\x2flvv\\.jde)|s(fhfksjzsfu\\x2fahm\\.uqs|rvziadzvzr\\x2fsaei\\.vvt)|n(kxlvcob\\x2fkmpk\\.ibl|pgwtjgxwthx\\x2fbyb\\.xky|hirmvtg\\x2fggqh\\.kqh)|(wlpgskmv\\x2flwzo\\.qv|gdvsotuqwsg\\x2fdxt\\.hd)g|(twfofrfzlugq\\x2feve\\.qd|doarauzeraqf\\x2fvvv\\.ul|qisezhin\\x2fiqor\\.ym)v|fowclxccdxn\\x2fuxwn\\.ddy|lnzzlnbk\\x2fpkrm\\.fin|iufilfwulmfi\\x2friuf\\.lio|(hsdszhmoshh\\x2flhr\\.cn|oakmanympnw\\x2flnkd\\.pk)h|riggiymd\\x2fwdhi\\.vhi|zmnjgmomgbdz\\x2fzzmw\\.gzt)

Cumulative parsing time for this PR: ~37[ms]
Cumulative parsing time for Devel: ~38[ms]

Copy link
Collaborator

@Adda0 Adda0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Adda0 Adda0 merged commit 037ecfe into devel Nov 4, 2024
18 checks passed
@Adda0 Adda0 deleted the regex_segfault branch November 4, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants