Skip to content

Fast regex in Rust for Apache Arrow, compiled to WASM

License

Notifications You must be signed in to change notification settings

nomic-ai/wasm-arrow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Warmbat

Web-assembly rust module for batchwise arrow transformations

image

I have been looking for a fast regular expression library in Javascript that runs on Apache Arrow for a few years. Arrow uses UTF-8 encoded strings, and the standard ways of decoding UTF-8 strings to use native JS regexes are excrutiatingly slow. Instead, I wrote some weird custom code to do it a byte at a time, that doesn't support full regular expressions, for the pubmed explorer.

Finally I got fed up, and just spent 90 minutes going back and forth with ChatGPT to get something that works on just does the whole thing in basic rust libraries. It's just about as fast.

The central goal here is speed, so it consumes everything as four arguments.

  1. A Uint8Array that represents a large number of strings, concatenated together.
  2. A Uint32Array that represents the offsets of the strings in the first array.
  3. The regex to match, as a string.
  4. A boolean that indicates whether to use case-insensitive matching.

(1) and (2) are Arrow-formatted data; we can get things into Rust with minimal copying/conversion this way.

The return value is a Float32Array that represents the number of matches in each string for the regex.

About

Fast regex in Rust for Apache Arrow, compiled to WASM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published