-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impl core::str::Pattern for [&str; N]
#395
Comments
This seems like an enormous drawback and performance footgun. And not just "oh it isn't SIMD optimized," but by imposing the non-allocating restriction, you probably have a serious time complexity footgun here too. I'm not even sure what's the best you can do for multi-substring search without either any allocation or some kind of limit on the size/number of patterns.
|
Worst case would be if the haystack is highly repetitive and most of the pattern strings are very similar. Something like "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab".find(&[
"aaaaaaaaaaaaaab",
"aaaaaaaaaaaaaac",
"aaaaaaaaaaaaaad",
"aaaaaaaaaaaaaae",
"z", // kills common prefix optimization
])
|
The time complexity would be O(# strings in needle) * (time complexity of single- |
I'm pretty sure I would never approve of this coming into the standard library. You will just never convince me that this proposal won't lead to folks being constantly surprised by how slow it is. The fact that you keep qualifying this with "assuming everything is small" isn't a mitigation from my perspective, because there is nothing in your proposal preventing folks from using it with inputs that aren't small. |
A note in the the docs would do the job. This wouldn't be the only API in std that prioritizes broad compatibility and small code size over maximum performance for every possible input. |
I don't agree it would the job. And I don't know of any precedent that is even remotely in the same league as the perf footgun this would introduce. |
Well, there's the provider API where I'm unhappy about the perf implications. But that's still unstable. |
One alternative would be to not have any implementation for slices, only arrays; that way, we can store |
I wouldn't accept this as a precedent until I could see how the perf footgun actually compares. The perf footgun being proposed here is not just a small thing. It's enormous. Consider searching 100 strings in a 1GB haystack. That's 100 full scans of that haystack. That's 99 more than is actually necessary. And if you clear that precedent hurdle, I'd also need to see what the alternatives are. Even if the perf footgun is enormous, it has to be considered in the context of what else is actually possible. In this case, the API being proposed has a very simple alternative: use a third party crate.
And that is not the only hurdle to clear here. The other hurdle is the construction cost. We have that problem with single substring search today, but I speculate it would be worse with multi-substring search. |
If we can find a workable algorithm, one option would be to only use it if the number or length of the needles is above some threshold, and otherwise use the naïve solution (which would presumably be faster to construct). |
I think our best bet would likely be Rabin-Karp, with (# of needles) rolling hashes. Average-case complexity would be O(length of haystack * # of needles), but worst-case time complexity would be O(length of haystack * sum of needle lengths)—same as the naïve implementation. Because we have no access to randomness in We could try to only require one hash per set of needles with the same length, perhaps; but that will only be a benefit when many of the needles have the same length, so would have to be weighed against the impact on implementation complexity and construction cost. One other potential optimization would be to use the naïve approach instead of Rabin-Karp for very small needles. I've updated the OP to reflect this. |
impl core::str::Pattern for &[&str]
impl core::str::Pattern for [&str; N]
I would need to see an implementation because I'm not sure how you implement Rabin-Karp with I want to be clear though, that even if you solve the time complexity problem, I've pointed out other hurdles that IMO still makes this proposal dead in the water. (In other words, you might not want to spend your time trying to come up with the implementation I'm asking for unless you're intrinsically motivated to discover it.) |
First of all, I realize I made a mistake/misinterpreted your comment above, sorry. I really did mean O(needles) of memory, not And secondly, I am realizing that we may not need to use Rabin-Karp (with its bad worst case), the two-way searcher algorithm that the single-string search functions use should be viable also with some adaptation. I'll need to think about it more though.
If you are referring to long initialization time: as I mentioned earlier, we can dynamically choose the algorithm based on the length of each needle, and possibly also they haystack, which search algorithm to use for that needle. So for short needles, we can use the naïve implementation with its fast construction time. (This is something the single-string seach could be doing also). |
I'm well aware of the strategy of dynamically choosing an algorithm. |
A hit that we are forced to take anyway, due to the |
You're only forced to pay My objection here is on two completely different grounds. The first is that the API itself imposes a ceiling on performance because every single search (other than in an iterator) is forced to rebuild whatever is needed for the search. The second is that even if we didn't care about that construction cost, the implementation being restricted to zero-allocation is significant and severely limits what you're able to do. The way to resolve the first objection is to champion a new substring search API that permits amortizing searcher construction. I do not see any other possible choice here. I do not see myself approving multi-substring search in std that doesn't allow amortizing searcher construction. And this is a bit of a catch-22, because amortizing searcher construction will likely make the API surface area much bigger and more complicated, which will in turn make it harder to drive consensus. And now we're getting into, "why not just use the The way to resolve the second objection is to find a way around the no-allocation restriction. You are chasing a path that involves using stack space proportional to the number of the needles (which seems dubious to me) and dynamic switching of algorithms employed, but I personally don't see a way to make that work in a way that would be appropriate for |
I agree, but I also don't see why it should be a blocker to making any progress at all. As you note, the searcher construction problem already exists for the single-substring case. And the extra cost per needle is no greater for the API I am proposing, than for the existing Also, the initial heuristic for whether to use the naïve algorithm would likely be just a check of the needle length, so there would only be a risk of major slowdown when using many long needles. Which is perfectly acceptable, just like it's perfectly acceptable that the existing
No, I want an implementation that doesn't allocate. I want the implementation of |
With an enormous perf footgun.
Because I don't want to make it worse. And I speculate that it will be exacerbated for the multiple needle case. Otherwise, it looks like we're going in circles.
I don't see anything you've proposed here that can't be done by I would suggest that you go and prototype what you want in a crate. Once that's done, feel free to ping my handle for a review. Otherwise, I'm unsubscribing from this thread. |
Agreed, will do |
Proposal
Problem statement
I want to:
Motivating examples or use cases
Solution sketch
Implement
core::str::Pattern
for[&str; N]
and&[&str; N]
. The correspondingSearcher
s will implementReverseSearcher
, but notDoubleEndedSearcher
.In the future, once negative impls are stable, implement
!Borrow<str>
forchar
, and extend thePattern
impls to apply to arrays/slices of anyimpl Borrow<str>
.The implementation would use a non-allocating algorithm with
O(len(haystack) * |needles|)
time complexity (hopefully even in the worst case, though I will need to consider further whether this is achievable), as opposed to something more involved likeaho-corasick
. The exact algorithm may also change dynamically based on the length of the haystack and/or each needle.Alternatives
This functionality could be implemented as a crate on crates.io, but this would not allow usage with all the
std::str
APIs that usepattern
, likefind
andsplit
.What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: