Skip to content

ACP: str::chunks with chunks being &str #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tkr-sh opened this issue May 25, 2025 · 5 comments
Closed

ACP: str::chunks with chunks being &str #592

tkr-sh opened this issue May 25, 2025 · 5 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@tkr-sh
Copy link

tkr-sh commented May 25, 2025

Proposal

Problem statement

The std currently provides various methods for chunking slices (array_chunks) and iterators (chunks, chunks_exact, rchunks, array_chunks, utf8_chunks, ...).
However, there is no equivalent method for string slices. And currently, the "developer experience" related to chunks in &str can be improved.

Motivating examples or use cases

Chunking is an action that may often be needed when working with data that can be seen as an iterator.
This is why there are methods for this with slices and iterators.
But, there are none for &str even tho it can be useful a lot of time!
Here are some examples:

  • Converting binary or hexadecimal strings into an iterator of an integer.
    Currently we would do
let hex = "0xABCDEF";
let values = hex[2..]
    .bytes()
    .array_chunks::<2>()  // unstable
    .map(|arr| u8::from_str_radix(str::from_utf8(&arr).unwrap(), 16))  // .unwrap()

// Instead of possibly doing

let values = hex[2..]
    .chunks(2)
    .map(|str| u8::from_str_radix(str, 16))
  • Processsing some padded data like hello---only----8-------chars---
  • Wrapping some text safely
let user_text = "...";
user_text.chunks(width).intersperse("\n").collect::<String>()

Overall, everything that is about handling data with repetitive pattern or with some wrapping or formatting would benefit from this function.

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.
But, if you want to achieve the same thing in the current context, you will have create an unecessary vector:

let vec = "hello world".chars().collect::<Vec<_>>(); // Really inneficient
vec.as_slice().chunks(4) // ["hell", "o wo", "rld"]
// instead of just
"hello world".chunks(4) // ["hell", "o wo", "rld"]

It's

  1. more code
  2. less readable
  3. owning some unecessary data
  4. losing the borrowing lifetime of the initial string slice
fn example_when_owning(s: &str) -> Vec<&str> {
    let vec = "hello world".bytes().collect::<Vec<_>>();
    vec.as_slice()
        .chunks(4)
        .map(|bytes| str::from_utf8(bytes).unwrap())
        .collect() // Error! The function tries to return some borrowed data (str::from_utf8) declared in this function
}

fn example_when_borrowing(s: &str) -> Vec<&str> {
    "hello world".chunks(4).collect() // works fine!
}

Also, str::chunks() is faster than Chars::array_chunks() (without even considering str::from_utf8().unwrap())

Solution sketch

  • Create a new str::Chunks in core/src/str/iter.rs and implement Iterator & DoubleEndedIterator on it
  • Create a new method on str:
pub fn chunks(&self, chunk_size: usize) -> str::Chunks<'_> {
    str::Chunks::new(self, chunk_size)
}

Implementation at https://github.com/tkr-sh/rust/tree/str-chunks

Drawbacks

.chunks() on &str isn't necessary clear if it's on u8 or char. Tho, if chunks are &str it makes sens that it's on chars.

Alternatives

  • .chars().collect() then vec.as_slice().chunks() but it's significantly longer and is owning data that could be avoided. See motivation.
  • .chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.

Links and related work


From rust-lang/rfcs#3818

@tkr-sh tkr-sh added T-libs-api api-change-proposal A proposal to add or alter unstable APIs in the standard libraries labels May 25, 2025
@tkr-sh tkr-sh changed the title str::chunks with chunks being &str ACP: str::chunks with chunks being &str May 25, 2025
@scottmcm
Copy link
Member

Why is a consistent number of USVs a useful operation to do?

To me this seems like something for unicode-segmentation rather than std.

@clarfonthey
Copy link

I agree that "number of characters" is generally not a desired operation, and that it's much better to defer to the various Unicode segmentation algorithms instead.

I would argue that most of the issues here are the absence of methods like from_str_radix being available on bytes, although that method in particular is tracked as from_ascii_radix and is currently unstable.

@bluebear94
Copy link

If you wanted to chunk over chars, you could do str.chars().array_chunks::<N>() (on nightly), though this gives arrays of char instead of string slices. Also, a lot of uses for this seem to be focused on ASCII-based formats, in which case it makes more sense to taken in a slice of (the still unstable) ascii::Char (or try to convert your &str to one).

@tkr-sh
Copy link
Author

tkr-sh commented May 25, 2025

I'm ok to open a PR for unicode_segmentation if you think that this is a better idea!

Tho, I think that wrapping some text to fit a specific format can also be a common usage


If you wanted to chunk over chars, you could do str.chars().array_chunks::()

=>

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.

.chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.


Also, a lot of uses for this seem to be focused on ASCII-based formats

I think that only the first one is about ASCII

@the8472
Copy link
Member

the8472 commented May 27, 2025

We discussed this during today's libs-api meeting. We interpreted some of the motivating examples as exercises in manipulation of ASCII data. For those we suggest using the unstable ascii::Char APIs, e.g.

text.as_ascii().unwrap().chunks(N).map(<[AsciiChar]>::as_str)`

There also is a recent PR to add a FromIterator impl which would would simplify joining them back into a String.

  • Wrapping some text safely

let user_text = "...";
user_text.chunks(width).intersperse("\n").collect::()

This is an example why it's not a good idea to add this API. As previous comments mentioned splitting at unicode scalar boundaries can have undesirable results. In this case it would split grapheme clusters and then not rejoin them and instead insert new chars in the middle which would for example lead to torn emoji or diacritics.

Plus fixed width wrapping is very tricky in unicode. E.g. the following quotes, rendered in monospace, each contain one char

let _: [char; 4] = [
    '​',
    'W',
    'W',
    '﷽',
];

Considering these motivating examples we're going to reject the proposal and recommend using more specialized APIs instead. Chunking over chars would appear to be a deceptively simple tool, but in many cases lead users down the wrong path.


And a small note: "but it's unstable" is not an argument in favor of a new API, since the new API would also start as unstable.

@the8472 the8472 closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

5 participants