No more segfault during regex parsing #451

koniksedy · 2024-10-30T19:11:23Z

This PR:

Eliminates a segmentation fault that occurred during regex parsing.
The segmentation fault was caused by a state renaming function, which also renamed initial states that were not marked as final and had no outgoing edges, thus lacking a specified new name (default name was mata::nfa::Limits::max_state).
Fixes the segmentation fault mentioned in Segmentation fault when parsing regexes #437.

Adda0

If this was the .mata files parser, I would say it might not be the intention to trim every automaton implicitly. You could not, for example, parse some template of an automaton which you would later modify. But for regex parser, this should not be the case, and you always want to get the most succinct representation of the regex. Under this assumption, the logic should be fine.

However, since we have effectively removed the renumber states function, I would just use nfa::Nfa::trim() directly at the place of calling re2parser::renumber_states(), but maybe even keep the original re2parser::renumber_states() (and fix it), or remove it entirely.

Have you checked the performance impact of this change? Was there any significant change in the performance of the regex parser?

Adda0 · 2024-10-31T12:41:51Z

After discussing this with @koniksedy in person, we decided to remove the renumber_states() function and only use trim(). Performing trimming is slightly slower, but the performance decrease in not significant (see the GH Actions checks). Testing on larger regexes shows no performance degradation whatsoever.

Adda0 · 2024-10-31T13:01:20Z

By the way, does it really fix the linked issue? It might be good to add one (or more) of the regexes from #437 to our unit tests suite to ensure this issue is not encountered again.

koniksedy · 2024-10-31T14:04:37Z

It does solve the segmentation fault problem mentioned in #437. However, all regular expressions provided in #437 result in an empty automaton. This behavior has already been reported in issue #450.

Adda0 · 2024-10-31T14:07:47Z

Right. Therefore, I would not close the issue #437 and keep the issue open until we resolve #450.

koniksedy · 2024-10-31T14:12:18Z

Performance results

single_param_jobs

metric                  Devel (avg)    Devel (med)      This (avg)      This (med)
--------------------  --------------  --------------  ----------------  ----------------
nfa-mintermization             0.011           0.002             0.012             0.002
nfa-parsing                    0.035           0.007             0.039             0.008
nfa-runtime                    0.036           0.000             0.037             0.000
nfa-trim                       0.000           0.000             0.000             0.000
unary-mintermization           0.010           0.002             0.010             0.002
unary-parsing                  0.033           0.007             0.034             0.007
unary-runtime                  0.294           0.010             0.309             0.010

double_param_jobs

metric                        Devel (avg)    Devel (med)      This (avg)      This (med)
--------------------------  --------------  --------------  ----------------  ----------------
binary-antichain_inclusion           0.002           0.000             0.002             0.000
binary-concatenation                 0.000           0.000             0.000             0.000
binary-intersection                  0.002           0.000             0.002             0.000
binary-mintermization                0.009           0.004             0.009             0.004
binary-naive_inclusion               0.100           0.001             0.106             0.001
binary-parsing                       0.030           0.013             0.034             0.016
binary-runtime                       0.127           0.010             0.136             0.010
binary-union                         0.000           0.000             0.000             0.000

The performance drop is negligible.

A set of regular expressions

.*[sS][yY][sS][tT][eE][mM][pP][aA][tT][hH]\\=(([hH][tT]{2}[pP][sS]?)|([fF][tT][pP]))
[^\\f\\n\\r\\t\\v]{65}|[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]{65}|[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]+[\\f\\n\\r\\t\\v]+[^\\f\\n\\r\\t\\v]{65}
get (X.downloadX[ -~]*|X.supernode[ -~]|X.status[ -~]|X.network[ -~]*|X.files|X.hash\\=[0-9a-f]*X[ -~]*) httpX1.1|user-agent: kazaa|x-kazaa(-username|-network|-ip|-supernodeip|-xferid|-xferuid|tag)|^give [0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]?[0-9]?[0-9]?
(\\*[\\x01\\x02].*\\x03\\x0b|\\*\\x01.?.?.?.?\\x01)|flapon|toc_signon.*0x
(\\x11\\x20\\x01...?\\x11|\\xfe\\xfd.?.?.?.?.?.?(\\x14\\x01\\x06|\\xff\\xff\\xff))|[\\]\\x01].?battlefield2
(\\x13bittorrent protocol|azver\\x01$|get Xscrape\\?info_hash\\=get Xannounce\\?info_hash\\=|get XclientXbitcometX|GET Xdata\\?fid\\=)|d1:ad2:id20:|\\x08'7P\\)[RP]
[a-z][a-z0-9\\-_]+|login: [\\x09-\\x0d -~]* name: [\\x09-\\x0d -~]* Directory:
get X.*icy-metadata:1|icy [1-5][0-9][0-9] [\\x09-\\x0d -~]*(content-type:audio|icy-)
([()]|get)(...?.?.?(reg|get|query)|.+User-Agent: (MozillaX4\\.0 \\(compatible; (MSIE 6\\.0; Windows NT 5\\.1;? ?\\)|MSIE 5\\.00; Windows 98\\))))|Keep-Alive\\x0d\\x0a\\x0d\\x0a[26]
\\x2f([defghilmnoqrstwz])(m(ookflolfctm\\x2fnmot\\.fmu|clvompycem\\x2fcen\\.vcn)|e(etbuviaebe\\x2feqv\\.bvv|mcndvwoemn\\x2flvv\\.jde)|s(fhfksjzsfu\\x2fahm\\.uqs|rvziadzvzr\\x2fsaei\\.vvt)|n(kxlvcob\\x2fkmpk\\.ibl|pgwtjgxwthx\\x2fbyb\\.xky|hirmvtg\\x2fggqh\\.kqh)|(wlpgskmv\\x2flwzo\\.qv|gdvsotuqwsg\\x2fdxt\\.hd)g|(twfofrfzlugq\\x2feve\\.qd|doarauzeraqf\\x2fvvv\\.ul|qisezhin\\x2fiqor\\.ym)v|fowclxccdxn\\x2fuxwn\\.ddy|lnzzlnbk\\x2fpkrm\\.fin|iufilfwulmfi\\x2friuf\\.lio|(hsdszhmoshh\\x2flhr\\.cn|oakmanympnw\\x2flnkd\\.pk)h|riggiymd\\x2fwdhi\\.vhi|zmnjgmomgbdz\\x2fzzmw\\.gzt)

Cumulative parsing time for this PR: ~37[ms]
Cumulative parsing time for Devel: ~38[ms]

Adda0

LGTM.

no more segfault

ecc240f

koniksedy requested review from jurajsic and Adda0 October 30, 2024 19:11

Adda0 reviewed Oct 31, 2024

View reviewed changes

replace function call with trim

f706d86

Adda0 approved these changes Oct 31, 2024

View reviewed changes

jurajsic approved these changes Nov 4, 2024

View reviewed changes

Adda0 merged commit 037ecfe into devel Nov 4, 2024
18 checks passed

Adda0 deleted the regex_segfault branch November 4, 2024 10:13

Adda0 mentioned this pull request Nov 4, 2024

Segmentation fault when parsing regexes #437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No more segfault during regex parsing #451

No more segfault during regex parsing #451

koniksedy commented Oct 30, 2024 •

edited by Adda0

Loading

Adda0 left a comment •

edited

Loading

Adda0 commented Oct 31, 2024 •

edited

Loading

Adda0 commented Oct 31, 2024

koniksedy commented Oct 31, 2024

Adda0 commented Oct 31, 2024 •

edited by koniksedy

Loading

koniksedy commented Oct 31, 2024 •

edited

Loading

Adda0 left a comment

No more segfault during regex parsing #451

No more segfault during regex parsing #451

Conversation

koniksedy commented Oct 30, 2024 • edited by Adda0 Loading

Adda0 left a comment • edited Loading

Choose a reason for hiding this comment

Adda0 commented Oct 31, 2024 • edited Loading

Adda0 commented Oct 31, 2024

koniksedy commented Oct 31, 2024

Adda0 commented Oct 31, 2024 • edited by koniksedy Loading

koniksedy commented Oct 31, 2024 • edited Loading

Performance results

single_param_jobs

double_param_jobs

A set of regular expressions

Adda0 left a comment

Choose a reason for hiding this comment

koniksedy commented Oct 30, 2024 •

edited by Adda0

Loading

Adda0 left a comment •

edited

Loading

Adda0 commented Oct 31, 2024 •

edited

Loading

Adda0 commented Oct 31, 2024 •

edited by koniksedy

Loading

koniksedy commented Oct 31, 2024 •

edited

Loading