Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement unicode safe api for String #1471

Merged
merged 2 commits into from
Jan 20, 2025
Merged

Conversation

Yu-zh
Copy link
Collaborator

@Yu-zh Yu-zh commented Jan 13, 2025

Part of #1452

This PR adds the following APIs for String in builtin for general purpose usage:

charcode_at(String, Int) -> Int
charcode_length(String) -> Int
codepoint_at(String, Int) -> Char
codepoint_length(String) -> Int
unsafe_charcode_at(String, Int) -> Int // charcode_at without bound checking

And adds the following APIs for implementing pattern matching on string:

length_eq(String, Int) -> Bool
length_ge(String, Int) -> Bool
rev_get(String, Int) -> Char

The next step is to change the behavior of op_get and length to codepoint_at and codepoint_length, and fix the various misuses.

One subtlety is that the codepoint_at here has different behavior that code_point_at in JavaScript. Here the index is the count of code points, whereas for JavaScript, the index is the count of utf16 code units. @bobzhang

Copy link

peter-jerry-ye-code-review bot commented Jan 13, 2025

‼️ This code review is generated by a bot. Please verify the content before trusting it.

Based on the git diff provided, here are three potential issues I've identified:

  1. Inconsistent String Length Behavior: There are two different length-related functions (charcode_length and codepoint_length) which might cause confusion. The commented-out code suggests there's a plan to change the default length() method to use codepoint_length, but this could break existing code that expects UTF-16 code unit counts. This inconsistency could lead to bugs in code that assumes a specific length behavior.

  2. Potential Performance Issue: In the length_eq and length_ge functions, there's an O(n) complexity for what could potentially be an O(1) operation if using raw UTF-16 lengths. While this provides more accurate Unicode character counts, it might cause unexpected performance issues in code that frequently checks string lengths.

  3. Error Handling Inconsistency: The code uses abort() for error handling in some places (like in codepoint_at and rev_get), but the error messages are inconsistent. For example:

    abort("index out of bounds")  // in some places
    abort("Index out of bounds: cannot access index \{index}")  // in others

    This inconsistency in error messages could make debugging more difficult.

These issues could impact code maintainability and reliability if not addressed.

@coveralls
Copy link
Collaborator

coveralls commented Jan 13, 2025

Pull Request Test Coverage Report for Build 4853

Details

  • 71 of 101 (70.3%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 82.944%

Changes Missing Coverage Covered Lines Changed/Added Lines %
string/view.mbt 20 23 86.96%
string/string.mbt 36 45 80.0%
builtin/string.mbt 12 30 40.0%
Totals Coverage Status
Change from base Build 4852: -0.3%
Covered Lines: 5048
Relevant Lines: 6086

💛 - Coveralls

@Yu-zh Yu-zh force-pushed the string-api branch 2 times, most recently from be66555 to 638c24c Compare January 14, 2025 02:14
@Yu-zh Yu-zh requested a review from bobzhang January 14, 2025 02:26
@Yu-zh Yu-zh force-pushed the string-api branch 4 times, most recently from f5e8f43 to 8ed8016 Compare January 20, 2025 10:57
@bobzhang bobzhang enabled auto-merge (rebase) January 20, 2025 13:42
@bobzhang bobzhang merged commit fbba555 into moonbitlang:main Jan 20, 2025
14 checks passed
@Yu-zh Yu-zh deleted the string-api branch January 20, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants