unicode characters result in incorrect start/end values #7508

danielroe · 2024-11-27T11:48:20Z

There's an issue with oxc-parser, where it incorrectly generates start/end column values:

import Parser from 'oxc-parser'

const code = `
const a = 'œœ'
callOnce(a)
`

const ast = Parser.parseSync(code, { sourceType: 'module', sourceFilename: 'test/nuxt/composables.test.ts' })
const secondStatement = ast.program.body[1]
if (secondStatement && secondStatement.type === 'ExpressionStatement') {
  console.log(secondStatement.expression)
  console.log(code.slice(secondStatement.expression.start, secondStatement.expression.end))
  // llOnce(a)
}

maybe related: #7484

The text was updated successfully, but these errors were encountered:

Boshen · 2024-11-27T11:54:32Z

The root cause is because Rust strings are utf8.

It seems like the usage of these spans are magic string manipulations. Let me investigate whether we can do this directly on the Rust side.

Boshen · 2024-11-27T11:58:40Z

We also need a getter for accessing the source text by these spans on the Rust exposed to node.js.

pumano · 2024-11-29T18:32:44Z

I got that problem when trying to implement eslint/id-length rule (currently in development).

@Boshen maybe my experience helps here:
when characters is unicode graphemes, they can be properly counted by unicode segmentation lib: https://docs.rs/unicode-segmentation/latest/unicode_segmentation/struct.Graphemes.html

I create specific function for that case:

fn count_graphemes(str: &str) -> usize {
    // if ascii count as usual
    if str.is_ascii() {
        return str.len();
    }
    return str.graphemes(true).collect::<FxHashSet<_>>().len();
}

that helps properly count characters which is unicode graphemes and help you set proper span

danielroe added the C-bug Category - Bug label Nov 27, 2024

Boshen self-assigned this Nov 27, 2024

danielroe mentioned this issue Nov 27, 2024

feat(nuxt): use oxc-parser instead of esbuild + acorn nuxt/nuxt#30066

Draft

1 task

yuyinws mentioned this issue Dec 4, 2024

feat: oxc parser unplugin/unplugin-turbo-console#52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicode characters result in incorrect start/end values #7508

unicode characters result in incorrect start/end values #7508

danielroe commented Nov 27, 2024 •

edited

Loading

Boshen commented Nov 27, 2024

Boshen commented Nov 27, 2024

pumano commented Nov 29, 2024

unicode characters result in incorrect start/end values #7508

unicode characters result in incorrect start/end values #7508

Comments

danielroe commented Nov 27, 2024 • edited Loading

Boshen commented Nov 27, 2024

Boshen commented Nov 27, 2024

pumano commented Nov 29, 2024

danielroe commented Nov 27, 2024 •

edited

Loading