Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode characters result in incorrect start/end values #7508

Open
danielroe opened this issue Nov 27, 2024 · 3 comments
Open

unicode characters result in incorrect start/end values #7508

danielroe opened this issue Nov 27, 2024 · 3 comments
Assignees
Labels
C-bug Category - Bug

Comments

@danielroe
Copy link

danielroe commented Nov 27, 2024

There's an issue with oxc-parser, where it incorrectly generates start/end column values:

import Parser from 'oxc-parser'

const code = `
const a = 'œœ'
callOnce(a)
`

const ast = Parser.parseSync(code, { sourceType: 'module', sourceFilename: 'test/nuxt/composables.test.ts' })
const secondStatement = ast.program.body[1]
if (secondStatement && secondStatement.type === 'ExpressionStatement') {
  console.log(secondStatement.expression)
  console.log(code.slice(secondStatement.expression.start, secondStatement.expression.end))
  // llOnce(a)
}

maybe related: #7484

@danielroe danielroe added the C-bug Category - Bug label Nov 27, 2024
@Boshen
Copy link
Member

Boshen commented Nov 27, 2024

The root cause is because Rust strings are utf8.

It seems like the usage of these spans are magic string manipulations. Let me investigate whether we can do this directly on the Rust side.

@Boshen Boshen self-assigned this Nov 27, 2024
@Boshen
Copy link
Member

Boshen commented Nov 27, 2024

We also need a getter for accessing the source text by these spans on the Rust exposed to node.js.

@pumano
Copy link
Contributor

pumano commented Nov 29, 2024

I got that problem when trying to implement eslint/id-length rule (currently in development).

@Boshen maybe my experience helps here:
when characters is unicode graphemes, they can be properly counted by unicode segmentation lib: https://docs.rs/unicode-segmentation/latest/unicode_segmentation/struct.Graphemes.html

I create specific function for that case:

fn count_graphemes(str: &str) -> usize {
    // if ascii count as usual
    if str.is_ascii() {
        return str.len();
    }
    return str.graphemes(true).collect::<FxHashSet<_>>().len();
}

that helps properly count characters which is unicode graphemes and help you set proper span

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category - Bug
Projects
None yet
Development

No branches or pull requests

3 participants