Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: implement char_slice for &str #481

Open
tisonkun opened this issue Nov 12, 2024 · 7 comments
Open

ACP: implement char_slice for &str #481

tisonkun opened this issue Nov 12, 2024 · 7 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@tisonkun
Copy link

Proposal

Problem statement

Although &str has chars and char_indices, currently, obtain a substring in aspect of chars is still wordy.

Motivating examples or use cases

fn string_slice(arg: &str, start: usize, len: usize) -> &str {
    let char_len = arg.chars().count();
    let end = start.saturating_add(len);
    let start = start.clamp(0, char_len);
    let end = end.clamp(0, char_len);

    // since len >= 0, start <= end
    if start < end {
        let str_len = arg.len();
        let mut indices = arg.char_indices().map(|(i, _)| i);
        let lo = indices.nth(start).unwrap_or(str_len);
        let hi = indices.nth(end - start - 1).unwrap_or(str_len);
        &arg[lo..hi]
    } else {
        ""
    }
}

Solution sketch

Implement a char_slice method for &str:

impl str {
  pub fn char_slice(&self, begin: usize, end: usize) -> &str { ... }
  // or accept `RangeBounds`, wrap a new type, etc.
}

See also Alternatives below for other possible APIs. And I feel that we can discuss about the details of the implementation on a PR once we agree on the overall direction.

Alternatives

It may be more intuitive to use the slice syntax "my lovely string"[lo..hi], but that is taken by the bytes-level slices.

It can be also possible to add a wrapper type like struct CharStr<'a>(&'a str) and implement the slice syntax on the new type, but I'm not sure if it falls into Rust's idiom.

There is also third-party crate like stringslice. But IMO it's a bit over generic and less maintainable. Given that this is an essential part of string manipulation, perhaps we can add it to the std.

Links and related work

@tisonkun tisonkun added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Nov 12, 2024
@programmerjake
Copy link
Member

programmerjake commented Nov 12, 2024

I would argue that if you're trying to slice by indexes of chars, you're designing your code wrong, so this method is intentionally left out and you should strongly consider using byte indexes instead or use something like splitting into grapheme clusters (which are actually what you should be using if you're trying to split it into user visible characters, though there are still caveats surrounding special characters like "ffi" (which is only 1 char) and font rendering oddities), unicode text is really complicated!

@tisonkun
Copy link
Author

tisonkun commented Nov 12, 2024

@programmerjake I'm writing a database to provide string functions. Database users typically slice strings by indices of chars (e.g., SUBSTR).

I know that some unicode character is not typically "one intuitive human-readable character," and that may result in a third-party crates rather than an std function (the current state). So here is the issue to see other users feedback.

splitting into grapheme clusters

Are there some references or implementations to refer to?

@kennytm
Copy link
Member

kennytm commented Nov 12, 2024

#![feature(iter_advance_by)]

fn char_slice(a: &str, begin: usize, end: usize) -> &str {
    let mut chars = a.chars();
    chars.advance_by(begin).expect("begin index in range");
    let slice_1 = chars.as_str();
    chars.advance_by(end - begin).expect("end index in range");
    let slice_2 = chars.as_str();
    &slice_1[..slice_1.len() - slice_2.len()]
}

fn main() {
    assert_eq!(char_slice("零1二3四5六7八9十", 2, 7), "二3四5六");
}

@programmerjake
Copy link
Member

@programmerjake I'm writing a database to provide string functions. Database users typically slice strings by indices of chars (e.g., SUBSTR).

ok, so SQL was mis-designed (though in their defense they probably designed it back when everyone thought unicode characters were the one true character, like Win32's wchar_t).

splitting into grapheme clusters

Are there some references or implementations to refer to?

reference: https://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
the most commonly used implementation:
https://crates.io/crates/unicode-segmentation

@scottmcm
Copy link
Member

#![feature(iter_advance_by)]

A stable version that almost works:

fn char_slice(a: &str, begin: usize, end: usize) -> &str {
    let (begin, _) = a.char_indices().nth(begin).unwrap();
    let (end, _) = a.char_indices().nth(end).unwrap();
    &a[begin..end]
}

fn main() {
    assert_eq!(char_slice("零1二3四5六7八9十", 2, 7), "二3四5六");
}

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3206517e3e563eec98d8ad153ba4d402

Though that really makes me want a .char_fenceposts() -> impl Iterator<Item = usize> method, so it wouldn't have the end-of-string gotcha.

@programmerjake
Copy link
Member

a stable version that should work:

fn remove_leading_chars(s: &str, count: usize) -> &str {
    count.checked_sub(1).map_or(s, |n| {
        let mut chars = s.chars();
        chars.nth(n).expect("index out of range");
        chars.as_str()
    })
}

fn char_slice(s: &str, begin: usize, end: usize) -> &str {
    let slice_1 = remove_leading_chars(s, begin);
    let slice_2 = remove_leading_chars(slice_1, end - begin);
    &slice_1[..slice_1.len() - slice_2.len()]
}

Though that really makes me want a .char_fenceposts() -> impl Iterator<Item = usize> method, so it wouldn't have the end-of-string gotcha.

imo we need to stabilize Iterator::advance_by, that way you're not tempted to add new iterator types merely to work around all the nice Iterator methods being unstable.

@scottmcm
Copy link
Member

imo we need to stabilize Iterator::advance_by, that way you're not tempted to add new iterator types merely to work around all the nice Iterator methods being unstable.

While I would like to see that stabilized, I don't think that's the reason I'm wanting it. The problem is the existing indices ones are asymmetric -- .char_indices().nth(0).0 does something very different from .char_indices().rev.nth(0).0.

It's like we allowed my_slice.split_at(0) but not my_slice.split_at(my_slice.len()). When there are n USVs, there are n+1 fenceposts, and we should have n+1-length iterators for that too. I guess that exists as .match_indices(""), but without all the pattern-searching overhead.

(Not this issue's problem, though.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

4 participants