ACP: implement `char_slice` for `&str` #481

tisonkun · 2024-11-12T07:41:58Z

Proposal

Problem statement

Although &str has chars and char_indices, currently, obtain a substring in aspect of chars is still wordy.

Motivating examples or use cases

fn string_slice(arg: &str, start: usize, len: usize) -> &str {
    let char_len = arg.chars().count();
    let end = start.saturating_add(len);
    let start = start.clamp(0, char_len);
    let end = end.clamp(0, char_len);

    // since len >= 0, start <= end
    if start < end {
        let str_len = arg.len();
        let mut indices = arg.char_indices().map(|(i, _)| i);
        let lo = indices.nth(start).unwrap_or(str_len);
        let hi = indices.nth(end - start - 1).unwrap_or(str_len);
        &arg[lo..hi]
    } else {
        ""
    }
}

Solution sketch

Implement a char_slice method for &str:

impl str {
  pub fn char_slice(&self, begin: usize, end: usize) -> &str { ... }
  // or accept `RangeBounds`, wrap a new type, etc.
}

See also Alternatives below for other possible APIs. And I feel that we can discuss about the details of the implementation on a PR once we agree on the overall direction.

Alternatives

It may be more intuitive to use the slice syntax "my lovely string"[lo..hi], but that is taken by the bytes-level slices.

It can be also possible to add a wrapper type like struct CharStr<'a>(&'a str) and implement the slice syntax on the new type, but I'm not sure if it falls into Rust's idiom.

There is also third-party crate like stringslice. But IMO it's a bit over generic and less maintainable. Given that this is an essential part of string manipulation, perhaps we can add it to the std.

Links and related work

https://crates.io/crates/stringslice

The text was updated successfully, but these errors were encountered:

programmerjake · 2024-11-12T09:22:41Z

I would argue that if you're trying to slice by indexes of chars, you're designing your code wrong, so this method is intentionally left out and you should strongly consider using byte indexes instead or use something like splitting into grapheme clusters (which are actually what you should be using if you're trying to split it into user visible characters, though there are still caveats surrounding special characters like "ﬃ" (which is only 1 char) and font rendering oddities), unicode text is really complicated!

tisonkun · 2024-11-12T09:38:06Z

@programmerjake I'm writing a database to provide string functions. Database users typically slice strings by indices of chars (e.g., SUBSTR).

I know that some unicode character is not typically "one intuitive human-readable character," and that may result in a third-party crates rather than an std function (the current state). So here is the issue to see other users feedback.

splitting into grapheme clusters

Are there some references or implementations to refer to?

kennytm · 2024-11-12T10:52:21Z

#![feature(iter_advance_by)]

fn char_slice(a: &str, begin: usize, end: usize) -> &str {
    let mut chars = a.chars();
    chars.advance_by(begin).expect("begin index in range");
    let slice_1 = chars.as_str();
    chars.advance_by(end - begin).expect("end index in range");
    let slice_2 = chars.as_str();
    &slice_1[..slice_1.len() - slice_2.len()]
}

fn main() {
    assert_eq!(char_slice("零1二3四5六7八9十", 2, 7), "二3四5六");
}

programmerjake · 2024-11-12T10:55:41Z

@programmerjake I'm writing a database to provide string functions. Database users typically slice strings by indices of chars (e.g., SUBSTR).

ok, so SQL was mis-designed (though in their defense they probably designed it back when everyone thought unicode characters were the one true character, like Win32's wchar_t).

splitting into grapheme clusters

Are there some references or implementations to refer to?

reference: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
the most commonly used implementation:
https://crates.io/crates/unicode-segmentation

scottmcm · 2024-11-12T23:17:39Z

#![feature(iter_advance_by)]

A stable version that almost works:

fn char_slice(a: &str, begin: usize, end: usize) -> &str {
    let (begin, _) = a.char_indices().nth(begin).unwrap();
    let (end, _) = a.char_indices().nth(end).unwrap();
    &a[begin..end]
}

fn main() {
    assert_eq!(char_slice("零1二3四5六7八9十", 2, 7), "二3四5六");
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3206517e3e563eec98d8ad153ba4d402

Though that really makes me want a .char_fenceposts() -> impl Iterator<Item = usize> method, so it wouldn't have the end-of-string gotcha.

programmerjake · 2024-11-13T04:59:25Z

a stable version that should work:

fn remove_leading_chars(s: &str, count: usize) -> &str {
    count.checked_sub(1).map_or(s, |n| {
        let mut chars = s.chars();
        chars.nth(n).expect("index out of range");
        chars.as_str()
    })
}

fn char_slice(s: &str, begin: usize, end: usize) -> &str {
    let slice_1 = remove_leading_chars(s, begin);
    let slice_2 = remove_leading_chars(slice_1, end - begin);
    &slice_1[..slice_1.len() - slice_2.len()]
}

Though that really makes me want a .char_fenceposts() -> impl Iterator<Item = usize> method, so it wouldn't have the end-of-string gotcha.

imo we need to stabilize Iterator::advance_by, that way you're not tempted to add new iterator types merely to work around all the nice Iterator methods being unstable.

scottmcm · 2024-11-13T16:18:13Z

imo we need to stabilize Iterator::advance_by, that way you're not tempted to add new iterator types merely to work around all the nice Iterator methods being unstable.

While I would like to see that stabilized, I don't think that's the reason I'm wanting it. The problem is the existing indices ones are asymmetric -- .char_indices().nth(0).0 does something very different from .char_indices().rev.nth(0).0.

It's like we allowed my_slice.split_at(0) but not my_slice.split_at(my_slice.len()). When there are n USVs, there are n+1 fenceposts, and we should have n+1-length iterators for that too. I guess that exists as .match_indices(""), but without all the pattern-searching overhead.

(Not this issue's problem, though.)

tisonkun added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACP: implement `char_slice` for `&str` #481

ACP: implement `char_slice` for `&str` #481

tisonkun commented Nov 12, 2024

programmerjake commented Nov 12, 2024 •

edited

Loading

tisonkun commented Nov 12, 2024 •

edited

Loading

kennytm commented Nov 12, 2024

programmerjake commented Nov 12, 2024

scottmcm commented Nov 12, 2024

programmerjake commented Nov 13, 2024

scottmcm commented Nov 13, 2024

ACP: implement char_slice for &str #481

ACP: implement char_slice for &str #481

Comments

tisonkun commented Nov 12, 2024

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Alternatives

Links and related work

programmerjake commented Nov 12, 2024 • edited Loading

tisonkun commented Nov 12, 2024 • edited Loading

kennytm commented Nov 12, 2024

programmerjake commented Nov 12, 2024

scottmcm commented Nov 12, 2024

programmerjake commented Nov 13, 2024

scottmcm commented Nov 13, 2024

ACP: implement `char_slice` for `&str` #481

ACP: implement `char_slice` for `&str` #481

programmerjake commented Nov 12, 2024 •

edited

Loading

tisonkun commented Nov 12, 2024 •

edited

Loading