Add cdata codegen, with eager output support #30

silentbicycle · 2024-10-10T19:45:43Z

This is an alternative for -lc and -lvmc that avoids very expensive compilation when the resulting C output is quite large. For this mode, most of the output is C data literals (a couple structs tables), followed by a very small (~50 loc) interpreter for the data. This is much faster to compile -- for a data set I'm working with now, it's 30 seconds to build compared to several hours and/or gcc exhausting memory.

Generating output with comments enabled will include inline comments about the format, along with per-state comments showing labels, endids, and eager outputs. It will only generate code for endids and eager outputs if the DFA has them.

This is experimental. I expect the interfaces will change a bit in the near future, and I am still working on performance tuning.

There is some code to detect and reuse repeated runs of IDs in the output tables, but there is a bug leading to them not being terminated properly (possibly causing false positives), so it's currently disabled.

To see a good example of the format, with comments, run:
build/bin//re -rpcre -lcdata -u '^abc'

Draft: This is currently targeting the sv/eager-outputs branch, because it depends on changes added there. Once that has been reviewed and merged I will incorporate any changes from its review and retarget this to main.

This is an alternative for -lc and -lvmc that avoids very expensive compilation when the resulting C output is quite large. For this mode, most of the output is C data literals (a couple structs tables), followed by a very small (~50 loc) interpreter for the data. This is much faster to compile -- for a data set I'm working with now, it's 30 seconds to build compared to several hours and/or gcc exhausting memory. Generating output with comments enabled will include inline comments about the format, along with per-state comments showing labels, endids, and eager outputs. It will only generate code for endids and eager outputs if the DFA has them. This is experimental. I expect the interfaces will change a bit in the near future, and I am still working on performance tuning. There is some code to detect and reuse repeated runs of IDs in the output tables, but there is a bug leading to them not being terminated properly (possibly causing false positives), so it's currently disabled. To see a good example of the format, with comments, run: build/bin//re -rpcre -lcdata -u '^abc'

katef · 2024-10-12T15:09:56Z

fuzz/target.c

@@ -446,6 +446,8 @@ fuzz_eager_output(const uint8_t *data, size_t size)

 	size_t max_pattern_length = 0;

+	const unsigned seed = size == 0 ? 0 : data[0];


so I guess we'll srand() here

Yes, I'll add that before I switch from a draft PR.

This is now done in e133a74 on #29.

katef · 2024-10-12T15:11:26Z

src/re/main.c

@@ -124,6 +124,7 @@ lang_name(const char *name, enum fsm_print_lang *fsm_lang, enum ast_print_lang *
 		{ "rust",       FSM_PRINT_RUST       },
 		{ "sh",         FSM_PRINT_SH         },
 		{ "vmc",        FSM_PRINT_VMC        },
+		{ "cdata",      FSM_PRINT_CDATA      },


and rx, and retest too please!

rx: Added in 3ea0898.

retest: Added in 7a935b4, and that found a couple bugs because it exercised cdata output more widely.

src/libfsm/print/cdata.c

fsm_generate_matches is no longer seeding `rand()` directly.

I confirmed that the callers actually check the return. These should probably use the alloc interface (with `f_realloc`), but the existing print callback typedefs don't seem to pass along the alloc handle anymore, since it was removed from fsm_options, so if that changes it will be in a different commit.

If the dst_table buffer is empty, add a 0.

Addressed merge conflicts.

While technically equivalent, this looks confusnig. I'm not sure if it was a typo (thinking of `eager_output_buf.used > 0`) or some kind of search/replace artifact.

silentbicycle added 2 commits October 10, 2024 15:43

fuzzer: Add seed argument for fsm_generate_matches (interface change).

23e6142

katef reviewed Oct 12, 2024

View reviewed changes

katef requested changes Oct 12, 2024

View reviewed changes

src/libfsm/print/cdata.c Outdated Show resolved Hide resolved

silentbicycle added 3 commits October 12, 2024 11:38

Add srand(seed) to fuzzer harness.

4ea2588

fsm_generate_matches is no longer seeding `rand()` directly.

Add CDATA to rx.

3ea0898

silentbicycle mentioned this pull request Oct 12, 2024

Upstream sync: re_is_anchored and a few more misc. changes #31

Merged

silentbicycle added 6 commits October 12, 2024 12:50

cdata: Avoid "zero size arrays are an extension" warning.

4b61192

If the dst_table buffer is empty, add a 0.

cdata: Fix output type for single endid in AMBIG_EARLIEST.

1d5c0a0

cdata: Fix rollover on (255,255).

32121f3

cdata: Treat '\' as non-printable, since it leads to line continuation.

4544091

cdata: Ensure config->state_info is zeroed.

5f08778

Add cdata to retest. (This found a couple issues.)

7a935b4

Base automatically changed from sv/eager-outputs to main October 12, 2024 19:00

Merge branch 'f/main' into sv/add-cdata-codegen-with-eager-outputs

f0e58bc

Addressed merge conflicts.

silentbicycle marked this pull request as ready for review October 12, 2024 19:05

silentbicycle added 3 commits October 13, 2024 09:24

print: Add allocator handle to ir_print_f typedef.

9f90b01

cdata: Use passed-in alloc handle and its interface.

4fe0beb

Fix strange comparison on a bool.

6e9b7f1

While technically equivalent, this looks confusnig. I'm not sure if it was a typo (thinking of `eager_output_buf.used > 0`) or some kind of search/replace artifact.

katef approved these changes Oct 13, 2024

View reviewed changes

silentbicycle merged commit 7cb37be into main Oct 13, 2024
346 checks passed

silentbicycle deleted the sv/add-cdata-codegen-with-eager-outputs branch October 13, 2024 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cdata codegen, with eager output support #30

Add cdata codegen, with eager output support #30

silentbicycle commented Oct 10, 2024

katef Oct 12, 2024

silentbicycle Oct 12, 2024

silentbicycle Oct 12, 2024

katef Oct 12, 2024

silentbicycle Oct 12, 2024

		@@ -446,6 +446,8 @@ fuzz_eager_output(const uint8_t *data, size_t size)

		size_t max_pattern_length = 0;

		const unsigned seed = size == 0 ? 0 : data[0];

Add cdata codegen, with eager output support #30

Add cdata codegen, with eager output support #30

Conversation

silentbicycle commented Oct 10, 2024

katef Oct 12, 2024

Choose a reason for hiding this comment

silentbicycle Oct 12, 2024

Choose a reason for hiding this comment

silentbicycle Oct 12, 2024

Choose a reason for hiding this comment

katef Oct 12, 2024

Choose a reason for hiding this comment

silentbicycle Oct 12, 2024

Choose a reason for hiding this comment