-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cdata codegen, with eager output support #30
Conversation
This is an alternative for -lc and -lvmc that avoids very expensive compilation when the resulting C output is quite large. For this mode, most of the output is C data literals (a couple structs tables), followed by a very small (~50 loc) interpreter for the data. This is much faster to compile -- for a data set I'm working with now, it's 30 seconds to build compared to several hours and/or gcc exhausting memory. Generating output with comments enabled will include inline comments about the format, along with per-state comments showing labels, endids, and eager outputs. It will only generate code for endids and eager outputs if the DFA has them. This is experimental. I expect the interfaces will change a bit in the near future, and I am still working on performance tuning. There is some code to detect and reuse repeated runs of IDs in the output tables, but there is a bug leading to them not being terminated properly (possibly causing false positives), so it's currently disabled. To see a good example of the format, with comments, run: build/bin//re -rpcre -lcdata -u '^abc'
fuzz/target.c
Outdated
@@ -446,6 +446,8 @@ fuzz_eager_output(const uint8_t *data, size_t size) | |||
|
|||
size_t max_pattern_length = 0; | |||
|
|||
const unsigned seed = size == 0 ? 0 : data[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I guess we'll srand() here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add that before I switch from a draft PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -124,6 +124,7 @@ lang_name(const char *name, enum fsm_print_lang *fsm_lang, enum ast_print_lang * | |||
{ "rust", FSM_PRINT_RUST }, | |||
{ "sh", FSM_PRINT_SH }, | |||
{ "vmc", FSM_PRINT_VMC }, | |||
{ "cdata", FSM_PRINT_CDATA }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and rx, and retest too please!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsm_generate_matches is no longer seeding `rand()` directly.
I confirmed that the callers actually check the return. These should probably use the alloc interface (with `f_realloc`), but the existing print callback typedefs don't seem to pass along the alloc handle anymore, since it was removed from fsm_options, so if that changes it will be in a different commit.
If the dst_table buffer is empty, add a 0.
Addressed merge conflicts.
While technically equivalent, this looks confusnig. I'm not sure if it was a typo (thinking of `eager_output_buf.used > 0`) or some kind of search/replace artifact.
This is an alternative for -lc and -lvmc that avoids very expensive compilation when the resulting C output is quite large. For this mode, most of the output is C data literals (a couple structs tables), followed by a very small (~50 loc) interpreter for the data. This is much faster to compile -- for a data set I'm working with now, it's 30 seconds to build compared to several hours and/or gcc exhausting memory.
Generating output with comments enabled will include inline comments about the format, along with per-state comments showing labels, endids, and eager outputs. It will only generate code for endids and eager outputs if the DFA has them.
This is experimental. I expect the interfaces will change a bit in the near future, and I am still working on performance tuning.
There is some code to detect and reuse repeated runs of IDs in the output tables, but there is a bug leading to them not being terminated properly (possibly causing false positives), so it's currently disabled.
To see a good example of the format, with comments, run:
build/bin//re -rpcre -lcdata -u '^abc'
Draft: This is currently targeting the sv/eager-outputs branch, because it depends on changes added there. Once that has been reviewed and merged I will incorporate any changes from its review and retarget this to main.