ACP: `str::chunks` with chunks being `&str` #592

tkr-sh · 2025-05-25T00:10:32Z

Proposal

Problem statement

The std currently provides various methods for chunking slices (array_chunks) and iterators (chunks, chunks_exact, rchunks, array_chunks, utf8_chunks, ...).
However, there is no equivalent method for string slices. And currently, the "developer experience" related to chunks in &str can be improved.

Motivating examples or use cases

Chunking is an action that may often be needed when working with data that can be seen as an iterator.
This is why there are methods for this with slices and iterators.
But, there are none for &str even tho it can be useful a lot of time!
Here are some examples:

Converting binary or hexadecimal strings into an iterator of an integer.
Currently we would do

let hex = "0xABCDEF";
let values = hex[2..]
    .bytes()
    .array_chunks::<2>()  // unstable
    .map(|arr| u8::from_str_radix(str::from_utf8(&arr).unwrap(), 16))  // .unwrap()

// Instead of possibly doing

let values = hex[2..]
    .chunks(2)
    .map(|str| u8::from_str_radix(str, 16))

Processsing some padded data like hello---only----8-------chars---
Wrapping some text safely

let user_text = "...";
user_text.chunks(width).intersperse("\n").collect::<String>()

Overall, everything that is about handling data with repetitive pattern or with some wrapping or formatting would benefit from this function.

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.
But, if you want to achieve the same thing in the current context, you will have create an unecessary vector:

let vec = "hello world".chars().collect::<Vec<_>>(); // Really inneficient
vec.as_slice().chunks(4) // ["hell", "o wo", "rld"]
// instead of just
"hello world".chunks(4) // ["hell", "o wo", "rld"]

It's

more code
less readable
owning some unecessary data
losing the borrowing lifetime of the initial string slice

fn example_when_owning(s: &str) -> Vec<&str> {
    let vec = "hello world".bytes().collect::<Vec<_>>();
    vec.as_slice()
        .chunks(4)
        .map(|bytes| str::from_utf8(bytes).unwrap())
        .collect() // Error! The function tries to return some borrowed data (str::from_utf8) declared in this function
}

fn example_when_borrowing(s: &str) -> Vec<&str> {
    "hello world".chunks(4).collect() // works fine!
}

Also, str::chunks() is faster than Chars::array_chunks() (without even considering str::from_utf8().unwrap())

Solution sketch

Create a new str::Chunks in core/src/str/iter.rs and implement Iterator & DoubleEndedIterator on it
Create a new method on str:

pub fn chunks(&self, chunk_size: usize) -> str::Chunks<'_> {
    str::Chunks::new(self, chunk_size)
}

Implementation at https://github.com/tkr-sh/rust/tree/str-chunks

Drawbacks

.chunks() on &str isn't necessary clear if it's on u8 or char. Tho, if chunks are &str it makes sens that it's on chars.

Alternatives

.chars().collect() then vec.as_slice().chunks() but it's significantly longer and is owning data that could be avoided. See motivation.
.chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.

Links and related work

slice::chunks(usize)
str::chars()
Iterator::array_chunks(usize)
ACP: add str::chunks, str::chunks_exact, and str::windows #590
- It was rejected in part because
  
  Given the issues related to UTF-8 boundaries causing potential foot-guns [...]
  
  which shouldn't affect this ACP.

From rust-lang/rfcs#3818

The text was updated successfully, but these errors were encountered:

scottmcm · 2025-05-25T01:34:02Z

Why is a consistent number of USVs a useful operation to do?

To me this seems like something for unicode-segmentation rather than std.

clarfonthey · 2025-05-25T02:18:22Z

I agree that "number of characters" is generally not a desired operation, and that it's much better to defer to the various Unicode segmentation algorithms instead.

I would argue that most of the issues here are the absence of methods like from_str_radix being available on bytes, although that method in particular is tracked as from_ascii_radix and is currently unstable.

bluebear94 · 2025-05-25T02:36:44Z

If you wanted to chunk over chars, you could do str.chars().array_chunks::<N>() (on nightly), though this gives arrays of char instead of string slices. Also, a lot of uses for this seem to be focused on ASCII-based formats, in which case it makes more sense to taken in a slice of (the still unstable) ascii::Char (or try to convert your &str to one).

tkr-sh · 2025-05-25T12:55:18Z

I'm ok to open a PR for unicode_segmentation if you think that this is a better idea!

Tho, I think that wrapping some text to fit a specific format can also be a common usage

If you wanted to chunk over chars, you could do str.chars().array_chunks::()

=>

Another problem is that, array_chunks doesn't have the same behaviour as slice::chunk since the last element is discarded if it doesn't do the same size as chunk_size which isn't always wanted.

.chars().array_chunks() but it's unstable, slower and doesn't behave in the same way. See motivation.

Also, a lot of uses for this seem to be focused on ASCII-based formats

I think that only the first one is about ASCII

the8472 · 2025-05-27T20:14:31Z

We discussed this during today's libs-api meeting. We interpreted some of the motivating examples as exercises in manipulation of ASCII data. For those we suggest using the unstable ascii::Char APIs, e.g.

text.as_ascii().unwrap().chunks(N).map(<[AsciiChar]>::as_str)`

There also is a recent PR to add a FromIterator impl which would would simplify joining them back into a String.

Wrapping some text safely

let user_text = "...";
user_text.chunks(width).intersperse("\n").collect::()

This is an example why it's not a good idea to add this API. As previous comments mentioned splitting at unicode scalar boundaries can have undesirable results. In this case it would split grapheme clusters and then not rejoin them and instead insert new chars in the middle which would for example lead to torn emoji or diacritics.

Plus fixed width wrapping is very tricky in unicode. E.g. the following quotes, rendered in monospace, each contain one char

let _: [char; 4] = [
    '',
    'Ｗ',
    'W',
    '﷽',
];

Considering these motivating examples we're going to reject the proposal and recommend using more specialized APIs instead. Chunking over chars would appear to be a deceptively simple tool, but in many cases lead users down the wrong path.

And a small note: "but it's unstable" is not an argument in favor of a new API, since the new API would also start as unstable.

tkr-sh added T-libs-api api-change-proposal A proposal to add or alter unstable APIs in the standard libraries labels May 25, 2025

tkr-sh mentioned this issue May 25, 2025

RFC: New method for str: str::chunks(usize) rust-lang/rfcs#3818

Closed

tkr-sh changed the title ~~str::chunks with chunks being &str~~ ACP: str::chunks with chunks being &str May 25, 2025

the8472 closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ACP: `str::chunks` with chunks being `&str` #592

ACP: `str::chunks` with chunks being `&str` #592

tkr-sh commented May 25, 2025 •

edited

Loading

scottmcm commented May 25, 2025

Uh oh!

clarfonthey commented May 25, 2025

Uh oh!

bluebear94 commented May 25, 2025

Uh oh!

tkr-sh commented May 25, 2025 •

edited

Loading

Uh oh!

the8472 commented May 27, 2025

Uh oh!

ACP: str::chunks with chunks being &str #592

ACP: str::chunks with chunks being &str #592

Comments

tkr-sh commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Drawbacks

Alternatives

Links and related work

scottmcm commented May 25, 2025

Uh oh!

clarfonthey commented May 25, 2025

Uh oh!

bluebear94 commented May 25, 2025

Uh oh!

tkr-sh commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

the8472 commented May 27, 2025

Uh oh!

ACP: `str::chunks` with chunks being `&str` #592

ACP: `str::chunks` with chunks being `&str` #592

tkr-sh commented May 25, 2025 •

edited

Loading

tkr-sh commented May 25, 2025 •

edited

Loading