Skip to content

Commit e4f8bc8

Browse files
committed
Major literal optimization refactoring.
The principle change in this commit is a complete rewrite of how literals are detected from a regular expression. In particular, we now traverse the abstract syntax to discover literals instead of the compiled byte code. This permits more tuneable control over which and how many literals are extracted, and is now exposed in the `regex-syntax` crate so that others can benefit from it. Other changes in this commit: * The Boyer-Moore algorithm was rewritten to use my own concoction based on frequency analysis. We end up regressing on a couple benchmarks slightly because of this, but gain in some others and in general should be faster in a broader number of cases. (Principally because we try to run `memchr` on the rarest byte in a literal.) This should also greatly improve handling of non-Western text. * A "reverse suffix" literal optimization was added. That is, if suffix literals exist but no prefix literals exist, then we can quickly scan for suffix matches and then run the DFA in reverse to find matches. (I'm not aware of any other regex engine that does this.) * The mutex-based pool has been replaced with a spinlock-based pool (from the new `mempool` crate). This reduces some amount of constant overhead and improves several benchmarks that either search short haystacks or find many matches in long haystacks. * Search parameters have been refactored. * RegexSet can now contain 0 or more regular expressions (previously, it could only contain 2 or more). The InvalidSet error variant is now deprecated. * A bug in computing start states was fixed. Namely, the DFA assumed the start states was always the first instruction, which is trivially wrong for an expression like `^☃$`. This bug persisted because it typically occurred when a literal optimization would otherwise run. * A new CLI tool, regex-debug, has been added as a non-published sub-crate. The CLI tool can answer various facts about regular expressions, such as printing its AST, its compiled byte code or its detected literals. Closes #96, #188, #189
1 parent 1f3454e commit e4f8bc8

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+3538
-1227
lines changed

Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ finite automata and guarantees linear time matching on all inputs.
1717
aho-corasick = "0.5"
1818
# For skipping along search text quickly when a leading byte is known.
1919
memchr = "0.1"
20+
# For managing regex caches quickly across multiple threads.
21+
mempool = "0.2"
2022
# For parsing regular expressions.
2123
regex-syntax = { path = "regex-syntax", version = "0.3.0" }
2224
# For compiling UTF-8 decoding into automata.

HACKING.md

Lines changed: 38 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,8 @@ The code to find prefixes and search for prefixes is in src/literals.rs. When
3232
more than one literal prefix is found, we fall back to an Aho-Corasick DFA
3333
using the aho-corasick crate. For one literal, we use a variant of the
3434
Boyer-Moore algorithm. Both Aho-Corasick and Boyer-Moore use `memchr` when
35-
appropriate.
35+
appropriate. The Boyer-Moore variant in this library also uses elementary
36+
frequency analysis to choose the write byte to run `memchr` with.
3637

3738
Of course, detecting prefix literals can only take us so far. Not all regular
3839
expressions have literal prefixes. To remedy this, we try another approach to
@@ -53,10 +54,12 @@ text results in at most one new DFA state. It is made fast by caching states.
5354
DFAs are susceptible to exponential state blow up (where the worst case is
5455
computing a new state for every input byte, regardless of what's in the state
5556
cache). To avoid using a lot of memory, the lazy DFA uses a bounded cache. Once
56-
the cache is full, it is wiped and state computation starts over again.
57+
the cache is full, it is wiped and state computation starts over again. If the
58+
cache is wiped too frequently, then the DFA gives up and searching falls back
59+
to one of the aforementioned algorithms.
5760

58-
All of the above matching engines expose precisely the matching semantics. This
59-
is indeed tested. (See the section below about testing.)
61+
All of the above matching engines expose precisely the same matching semantics.
62+
This is indeed tested. (See the section below about testing.)
6063

6164
The following sub-sections describe the rest of the library and how each of the
6265
matching engines are actually used.
@@ -70,6 +73,9 @@ encountered. Parsing is done in a separate crate so that others may benefit
7073
from its existence, and because it is relatively divorced from the rest of the
7174
regex library.
7275

76+
The regex-syntax crate also provides sophisticated support for extracting
77+
prefix and suffix literals from regular expressions.
78+
7379
### Compilation
7480

7581
The compiler is in src/compile.rs. The input to the compiler is some abstract
@@ -162,7 +168,7 @@ knows what the caller wants. Using this information, we can determine which
162168
engine (or engines) to use.
163169

164170
The logic for choosing which engine to execute is in src/exec.rs and is
165-
documented on the Exec type. Exec values collection regular expression
171+
documented on the Exec type. Exec values contain regular expression
166172
Programs (defined in src/prog.rs), which contain all the necessary tidbits
167173
for actually executing a regular expression on search text.
168174

@@ -172,6 +178,14 @@ of src/exec.rs by far is the execution of the lazy DFA, since it requires a
172178
forwards and backwards search, and then falls back to either the NFA algorithm
173179
or backtracking if the caller requested capture locations.
174180

181+
The parameterization of every search is defined in src/params.rs. Among other
182+
things, search parameters provide storage for recording capture locations and
183+
matches (for regex sets). The existence and nature of storage is itself a
184+
configuration for how each matching engine behaves. For example, if no storage
185+
for capture locations is provided, then the matching engines can give up as
186+
soon as a match is witnessed (which may occur well before the leftmost-first
187+
match).
188+
175189
### Programs
176190

177191
A regular expression program is essentially a sequence of opcodes produced by
@@ -268,48 +282,46 @@ N.B. To run tests for the `regex!` macro, use:
268282

269283
The benchmarking in this crate is made up of many micro-benchmarks. Currently,
270284
there are two primary sets of benchmarks: the benchmarks that were adopted at
271-
this library's inception (in `benches/bench.rs`) and a newer set of benchmarks
285+
this library's inception (in `benches/src/misc.rs`) and a newer set of benchmarks
272286
meant to test various optimizations. Specifically, the latter set contain some
273-
analysis and are in `benches/bench_sherlock.rs`. Also, the latter set are all
287+
analysis and are in `benches/src/sherlock.rs`. Also, the latter set are all
274288
executed on the same lengthy input whereas the former benchmarks are executed
275289
on strings of varying length.
276290

277291
There is also a smattering of benchmarks for parsing and compilation.
278292

293+
Benchmarks are in a separate crate so that its dependencies can be managed
294+
separately from the main regex crate.
295+
279296
Benchmarking follows a similarly wonky setup as tests. There are multiple
280297
entry points:
281298

282-
* `bench_native.rs` - benchmarks the `regex!` macro
283-
* `bench_dynamic.rs` - benchmarks `Regex::new`
284-
* `bench_dynamic_nfa.rs` benchmarks `Regex::new`, forced to use the NFA
285-
algorithm on every regex. (N.B. This can take a few minutes to run.)
299+
* `bench_rust_plugin.rs` - benchmarks the `regex!` macro
300+
* `bench_rust.rs` - benchmarks `Regex::new`
301+
* `bench_rust_bytes.rs` benchmarks `bytes::Regex::new`
286302
* `bench_pcre.rs` - benchmarks PCRE
303+
* `bench_onig.rs` - benchmarks Oniguruma
287304

288-
The PCRE benchmarks exist as a comparison point to a mature regular expression
289-
library. In general, this regex library compares favorably (there are even a
290-
few benchmarks that PCRE simply runs too slowly on or outright can't execute at
291-
all). I would love to add other regular expression library benchmarks
292-
(especially RE2), but PCRE is the only one with reasonable bindings.
305+
The PCRE and Oniguruma benchmarks exist as a comparison point to a mature
306+
regular expression library. In general, this regex library compares favorably
307+
(there are even a few benchmarks that PCRE simply runs too slowly on or
308+
outright can't execute at all). I would love to add other regular expression
309+
library benchmarks (especially RE2).
293310

294311
If you're hacking on one of the matching engines and just want to see
295312
benchmarks, then all you need to run is:
296313

297-
$ cargo bench --bench dynamic
314+
$ ./run-bench rust
298315

299316
If you want to compare your results with older benchmarks, then try:
300317

301-
$ cargo bench --bench dynamic | tee old
318+
$ ./run-bench rust | tee old
302319
$ ... make it faster
303-
$ cargo bench --bench dynamic | tee new
320+
$ ./run-bench rust | tee new
304321
$ cargo-benchcmp old new --improvements
305322

306323
The `cargo-benchcmp` utility is available here:
307324
https://github.com/BurntSushi/cargo-benchcmp
308325

309-
To run the same benchmarks on PCRE, you'll need to use the sub-crate in
310-
`regex-pcre-benchmark` like so:
311-
312-
$ cargo bench --manifest-path regex-pcre-benchmark/Cargo.toml
313-
314-
The PCRE benchmarks are separated from the main regex crate so that its
315-
dependency doesn't break builds in environments without PCRE.
326+
The `run-bench` utility can run benchmarks for PCRE and Oniguruma too. See
327+
`./run-bench --help`.

benches/Cargo.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ enum-set = "0.0.6"
1414
lazy_static = "0.1"
1515
onig = { version = "0.4", optional = true }
1616
pcre = { version = "0.2", optional = true }
17-
rand = "0.3"
1817
regex = { version = "0.1", path = ".." }
1918
regex_macros = { version = "0.1", path = "../regex_macros", optional = true }
2019
regex-syntax = { version = "0.3", path = "../regex-syntax" }

benches/src/bench_onig.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212

1313
#[macro_use] extern crate lazy_static;
1414
extern crate onig;
15-
extern crate rand;
1615
extern crate test;
1716

1817
use std::ops::Deref;

benches/src/bench_pcre.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,6 @@
2424
extern crate enum_set;
2525
#[macro_use] extern crate lazy_static;
2626
extern crate pcre;
27-
extern crate rand;
2827
extern crate test;
2928

3029
/// A nominal wrapper around pcre::Pcre to expose an interface similar to

benches/src/bench_rust.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
#![feature(test)]
1212

1313
#[macro_use] extern crate lazy_static;
14-
extern crate rand;
1514
extern crate regex;
1615
extern crate regex_syntax;
1716
extern crate test;

benches/src/bench_rust_bytes.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@
1111
#![feature(test)]
1212

1313
#[macro_use] extern crate lazy_static;
14-
extern crate rand;
1514
extern crate regex;
1615
extern crate regex_syntax;
1716
extern crate test;

benches/src/bench_rust_plugin.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,6 @@
1212
#![plugin(regex_macros)]
1313

1414
#[macro_use] extern crate lazy_static;
15-
extern crate rand;
1615
extern crate regex;
1716
extern crate regex_syntax;
1817
extern crate test;

benches/src/misc.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,14 @@ bench_match!(one_pass_long_prefix_not, regex!("^.bcdefghijklmnopqrstuvwxyz.*$"),
130130
"abcdefghijklmnopqrstuvwxyz".to_owned()
131131
});
132132

133+
bench_match!(long_needle1, regex!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaab"), {
134+
repeat("a").take(100_000).collect::<String>() + "b"
135+
});
136+
137+
bench_match!(long_needle2, regex!("bbbbbbbbbbbbbbbbbbbbbbbbbbbbbba"), {
138+
repeat("b").take(100_000).collect::<String>() + "a"
139+
});
140+
133141
#[cfg(feature = "re-rust")]
134142
#[bench]
135143
fn replace_all(b: &mut Bencher) {

benches/src/rust_compile.rs

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,13 @@ fn compile_simple_bytes(b: &mut Bencher) {
2929
});
3030
}
3131

32+
#[bench]
33+
fn compile_simple_full(b: &mut Bencher) {
34+
b.iter(|| {
35+
regex!(r"^bc(d|e)*$")
36+
});
37+
}
38+
3239
#[bench]
3340
fn compile_small(b: &mut Bencher) {
3441
b.iter(|| {
@@ -45,6 +52,13 @@ fn compile_small_bytes(b: &mut Bencher) {
4552
});
4653
}
4754

55+
#[bench]
56+
fn compile_small_full(b: &mut Bencher) {
57+
b.iter(|| {
58+
regex!(r"\p{L}|\p{N}|\s|.|\d")
59+
});
60+
}
61+
4862
#[bench]
4963
fn compile_huge(b: &mut Bencher) {
5064
b.iter(|| {
@@ -60,3 +74,10 @@ fn compile_huge_bytes(b: &mut Bencher) {
6074
Compiler::new().bytes(true).compile(&[re]).unwrap()
6175
});
6276
}
77+
78+
#[bench]
79+
fn compile_huge_full(b: &mut Bencher) {
80+
b.iter(|| {
81+
regex!(r"\p{L}{100}")
82+
});
83+
}

regex-debug/Cargo.toml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
[package]
2+
publish = false
3+
name = "regex-debug"
4+
version = "0.1.0"
5+
authors = ["The Rust Project Developers"]
6+
license = "MIT/Apache-2.0"
7+
repository = "https://github.com/rust-lang/regex"
8+
documentation = "http://doc.rust-lang.org/regex"
9+
homepage = "https://github.com/rust-lang/regex"
10+
description = "A tool useful for debugging regular expressions."
11+
12+
[dependencies]
13+
docopt = "0.6"
14+
regex = { version = "0.1", path = ".." }
15+
regex-syntax = { version = "0.3", path = "../regex-syntax" }
16+
rustc-serialize = "0.3"
17+
18+
[profile.release]
19+
debug = true

0 commit comments

Comments
 (0)