I have been looking for a fast regular expression library in Javascript that runs on Apache Arrow for a few years. Arrow uses UTF-8 encoded strings, and the standard ways of decoding UTF-8 strings to use native JS regexes are excrutiatingly slow. Instead, I wrote some weird custom code to do it a byte at a time, that doesn't support full regular expressions, for the pubmed explorer.
Finally I got fed up, and just spent 90 minutes going back and forth with ChatGPT to get something that works on just does the whole thing in basic rust libraries. It's just about as fast.
The central goal here is speed, so it consumes everything as four arguments.
- A Uint8Array that represents a large number of strings, concatenated together.
- A Uint32Array that represents the offsets of the strings in the first array.
- The regex to match, as a string.
- A boolean that indicates whether to use case-insensitive matching.
(1) and (2) are Arrow-formatted data; we can get things into Rust with minimal copying/conversion this way.
The return value is a Float32Array that represents the number of matches in each string for the regex.