Skip to content

add BytePattern for searching &[u8] #134350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

folkertdev
Copy link
Contributor

tracking issue: #134149

Right now, this duplicates a fair amount of code from core::str::pattern, to make sure that there isn't some sneaky interaction with utf8 boundaries. I think it is possible to share TwoWaySearcher, I suppose it would make most sense to have str use the slice code in this case? But then there has also been talk about generalizing Pattern to slices more broadly (in some limited cases?). So before doing a lot of tedious work, I'd like to work out what the best code sharing strategy is.

Also, only some of the basic API functions are implemented right now. All the iterator variations can be added incrementally, looking at that all in one go (making sure that docs are all correct ect) was a bit too much for me, and probably for reviewers too. So that'll come later in smaller pieces.

some other general questions

  • naming: how hard should we try to not clash with names (e.g. could ByteSearcher just be Searcher given its home module)?
  • are the right things annotated with unstable?

@rustbot
Copy link
Collaborator

rustbot commented Dec 15, 2024

r? @ibraheemdev

rustbot has assigned @ibraheemdev.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Dec 15, 2024
@rust-log-analyzer

This comment has been minimized.

@BurntSushi
Copy link
Member

I think it is possible to share TwoWaySearcher, I suppose it would make most sense to have str use the slice code in this case?

Just a drive-by comment for now, but I believe there is and ought to be only one behavior difference, and it only applies when the needle is the empty string. In the str case, the empty string should only match at UTF-8 boundaries. But in the [u8] case, the empty string should match at every byte offset.

Apologies if you already knew this, but it's subtle and wanted to flag it. And also say that there shouldn't be any other behavior differences.

@ibraheemdev
Copy link
Member

ibraheemdev commented Jan 16, 2025

This is a lot of code that I'm not very familiar with so I'm going to pass this one on (feel free to take this one @BurntSushi if you are up for it). r? libs

@rustbot rustbot assigned Noratrieb and unassigned ibraheemdev Jan 16, 2025
@folkertdev
Copy link
Contributor Author

Yeah makes sense, If there is a decent way of making this PR smaller we can look at that too.

Just a drive-by comment for now, but I believe there is and ought to be only one behavior difference, and it only applies when the needle is the empty string. In the str case, the empty string should only match at UTF-8 boundaries. But in the [u8] case, the empty string should match at every byte offset.

Yes, I hope I've found all the places where that matters. From what I can see, the TwoWaySearcher (a big part of the complexity of the pattern searching) does not actually handle UTF-8 boundaries. That is done in the StrSearcher.

So maybe a better first PR is to move TwoWaySearcher over to slice::byte_pattern, and use it from str::pattern?

@folkertdev
Copy link
Contributor Author

I'll close this in favor of #135931 and other followups that break this change up into more manageable pieces

@folkertdev folkertdev closed this Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants