src: add a method to get string encoding info #56147

theweipeng · 2024-12-05T16:31:30Z

util: Add a method to get the encoding information of a string.

Currently, we should check the encoding of the string before we use buffer.write("string", 'latin1') and buffer.write("string", 'utf16le') to write to the buffer.
This PR adds a method to return the encoding information from V8.

Closes: #56090

codecov · 2024-12-05T22:42:14Z

Codecov Report

Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.54%. Comparing base (3c2da4b) to head (bd086ce).
Report is 171 commits behind head on main.

Files with missing lines	Patch %	Lines
src/node_v8.cc	81.25%	0 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #56147      +/-   ##
==========================================
+ Coverage   87.99%   88.54%   +0.55%     
==========================================
  Files         656      657       +1     
  Lines      188999   190794    +1795     
  Branches    35981    36611     +630     
==========================================
+ Hits       166301   168933    +2632     
+ Misses      15865    15044     -821     
+ Partials     6833     6817      -16

Files with missing lines	Coverage Δ
lib/v8.js	`99.34% <100.00%> (+0.70%)`	⬆️
src/node_external_reference.h	`100.00% <ø> (ø)`
src/node_v8.cc	`90.67% <81.25%> (-0.52%)`	⬇️

... and 138 files with indirect coverage changes

theweipeng · 2024-12-17T07:27:56Z

I've been waiting for some feedback on this PR, could someone take a look at this ? I'd appreciate any feedback you can provide. Thanks! @nodejs-github-bot

nodejs-github-bot · 2024-12-17T10:21:08Z

CI: https://ci.nodejs.org/job/node-test-pull-request/64079/

addaleax

If you want to add an API like this, keep in mind that this

does not return any information about the string itself, which does not have an encoding per se, only its underlying representation in the JS engine
does not provide reliable output

You'll probably want to rename it and modify the documentation as needed.

It's hard to see an actual use case for this API, though. Getting this information can be useful when dealing with JS strings in C++, but there this information is already directly available.

addaleax · 2024-12-17T20:01:19Z

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

theweipeng · 2024-12-18T02:51:53Z

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

Understood, I'll correct the API placement. Thanks for pointing it out.

addaleax

Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API

doc/api/v8.md

outdated

theweipeng · 2024-12-26T06:59:14Z

Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API

Could you please review the document changes and trigger the code CI?

nodejs-github-bot · 2024-12-26T16:53:02Z

CI: https://ci.nodejs.org/job/node-test-pull-request/64224/

theweipeng · 2024-12-27T01:48:01Z

Looks like a flaky test. Needs a re-run I think. @nodejs-github-bot

nodejs-github-bot · 2024-12-27T03:21:18Z

CI: https://ci.nodejs.org/job/node-test-pull-request/64228/

addaleax

and your review has been addressed

@theweipeng Yeah, the code looks good here. I've left two minor notes on the documentation, but definitely nothing blocking.

However, I've suggested above twice to provide a use case for this method -- i.e. a reason or example that explains why this should be part of the Node.js API. I think it's okay to consider that a requirement for merging a PR.

doc/api/v8.md

References: nodejs#56090

theweipeng · 2024-12-28T11:30:48Z

@nodejs-github-bot Needs a re-run please, I have corrected the documentation.

nodejs-github-bot · 2024-12-29T16:19:30Z

CI: https://ci.nodejs.org/job/node-test-pull-request/64253/

doc/api/v8.md

jasnell

I have the same kinds of concerns here as @addaleax @joyeecheung have raised about how specific to v8's current internal representation this is and that it could potentially come back to bite us later if v8 decided to change that, but overall the code and docs here LGTM.

I do worry that for smaller strings, the overahead of performing the check might actually be too high to worry about. This almost borders on being too niche of a use case but overall I've got no reason to block. Approving based on the code changes looking good.

theweipeng · 2024-12-31T06:29:40Z

@addaleax I've updated the PR based on your feedback. Could you please take a moment to check if it's ready to merge or if there's anything else that needs attention?

targos

There is still no clear use case for this. The OP mentions ucs2Write and latin1WriteStatic but these are internal methods so cannot be used as arguments to add this method publicly.

theweipeng · 2024-12-31T07:59:43Z

There is still no clear use case for this. The OP mentions ucs2Write and latin1WriteStatic but these are internal methods so cannot be used as arguments to add this method publicly.

Sorry, I had a problem with the expression in the OP. I didn't think to use these private APIs; I used buffer.write in the documentation. I have corrected the OP.

targos · 2024-12-31T08:12:03Z

Can you please answer @addaleax's question ? #56147 (review)

theweipeng · 2024-12-31T08:30:23Z

Can you please answer @addaleax's question ? #56147 (review)

Following her suggestion in the comment #56147 (comment), I have written an example in the documentation to explain why we need this API. https://github.com/nodejs/node/pull/56147/files#diff-fc79b2d1ad702cfaf107d5880b73e8360b36273edda73f128c00641637435c3cR1368

addaleax · 2025-01-03T00:45:44Z

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

theweipeng · 2025-01-03T02:41:36Z

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

Thank you for your patient guidance, I've learned a lot from it. I will simplify that document. Regarding your test, may I take a look at your test code? Because the improvement is quite noticeable on my machine.
Here is my test code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, Buffer.byteLength(input, 'utf8'));
        bf.write(input, 4, 'utf8');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

And here are the results:

before short latin1: 52.94ms
after short latin1: 21.978ms


before short utf16: 123.66ms
after short utf16: 55.009ms


before long latin1: 35.731ms
after long latin1: 24.655ms


before long utf16: 353.766ms
after long utf16: 58.19ms

I think the improvement comes from two aspects: one is the calculation of the byte length of strings, and the other is copying.

answered

test/parallel/test-v8-string-is-one-byte-representation.js

nodejs-github-bot · 2025-01-03T17:38:11Z

CI: https://ci.nodejs.org/job/node-test-pull-request/64311/

addaleax · 2025-01-03T20:42:32Z

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

theweipeng · 2025-01-04T02:33:54Z

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16

My machine: Apple M2 Pro 16GB

Here is my code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, input.length * 2);
        bf.write(input, 4, 'utf16le');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

Here are the results:

before short latin1: 73.69ms
after short latin1: 22.572ms


before short utf16: 60.322ms
after short utf16: 57.602ms


before long latin1: 62.184ms
after long latin1: 25.765ms


before long utf16: 61.797ms
after long utf16: 56.201ms

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. util Issues and PRs related to the built-in util module. labels Dec 5, 2024

theweipeng force-pushed the issue_56090 branch from 8944a8c to a4a445a Compare December 7, 2024 11:22

theweipeng mentioned this pull request Dec 10, 2024

Proposal: Add a method to check if a string is a OneByteString #56090

Open

theweipeng changed the title ~~util: add a method to get string encoding info~~ src: add a method to get string encoding info Dec 16, 2024

jakecastelli added request-ci Add this label to start a Jenkins CI on a PR. review wanted PRs that need reviews. labels Dec 17, 2024

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 17, 2024

jazelly requested a review from joyeecheung December 17, 2024 14:40

addaleax previously requested changes Dec 17, 2024

View reviewed changes

jazelly removed the request for review from joyeecheung December 18, 2024 04:24

theweipeng force-pushed the issue_56090 branch 3 times, most recently from 8a6f12e to 6f30b0e Compare December 21, 2024 14:20

addaleax reviewed Dec 21, 2024

View reviewed changes

doc/api/v8.md Outdated Show resolved Hide resolved

theweipeng force-pushed the issue_56090 branch from 6f30b0e to 6eaf64c Compare December 22, 2024 09:45

theweipeng requested a review from addaleax December 25, 2024 03:03

addaleax added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 26, 2024

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 26, 2024

addaleax reviewed Dec 27, 2024

View reviewed changes

doc/api/v8.md Outdated Show resolved Hide resolved

doc/api/v8.md Outdated Show resolved Hide resolved

src: detect whether the string is one byte representation or not

65e06d9

References: nodejs#56090

theweipeng force-pushed the issue_56090 branch from 6eaf64c to 65e06d9 Compare December 28, 2024 11:26

addaleax added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 29, 2024

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 29, 2024

jasnell reviewed Dec 29, 2024

View reviewed changes

doc/api/v8.md Outdated Show resolved Hide resolved

jasnell reviewed Dec 29, 2024

View reviewed changes

doc/api/v8.md Outdated Show resolved Hide resolved

jasnell reviewed Dec 29, 2024

View reviewed changes

doc/api/v8.md Outdated Show resolved Hide resolved

jasnell approved these changes Dec 29, 2024

View reviewed changes

doc: optimize the document

56278fe

targos previously requested changes Dec 31, 2024

View reviewed changes

doc: simplify the document

34d144f

legendecas reviewed Jan 3, 2025

View reviewed changes

test/parallel/test-v8-string-is-one-byte-representation.js Outdated Show resolved Hide resolved

doc: remove unused flags

bd086ce

legendecas added commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. request-ci Add this label to start a Jenkins CI on a PR. labels Jan 3, 2025

github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jan 3, 2025

theweipeng mentioned this pull request Jan 9, 2025

feat(javascript): removing HPS apache/fury#2001

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: add a method to get string encoding info #56147

src: add a method to get string encoding info #56147

theweipeng commented Dec 5, 2024 •

edited

Loading

codecov bot commented Dec 5, 2024 •

edited

Loading

theweipeng commented Dec 17, 2024

nodejs-github-bot commented Dec 17, 2024

addaleax left a comment

addaleax commented Dec 17, 2024 •

edited

Loading

theweipeng commented Dec 18, 2024

addaleax left a comment

theweipeng commented Dec 26, 2024

nodejs-github-bot commented Dec 26, 2024

theweipeng commented Dec 27, 2024

nodejs-github-bot commented Dec 27, 2024

addaleax left a comment

theweipeng commented Dec 28, 2024

nodejs-github-bot commented Dec 29, 2024

jasnell left a comment

theweipeng commented Dec 31, 2024

targos left a comment

theweipeng commented Dec 31, 2024

targos commented Dec 31, 2024 •

edited

Loading

theweipeng commented Dec 31, 2024

addaleax commented Jan 3, 2025

theweipeng commented Jan 3, 2025

nodejs-github-bot commented Jan 3, 2025

addaleax commented Jan 3, 2025 •

edited

Loading

theweipeng commented Jan 4, 2025

src: add a method to get string encoding info #56147

Are you sure you want to change the base?

src: add a method to get string encoding info #56147

Conversation

theweipeng commented Dec 5, 2024 • edited Loading

codecov bot commented Dec 5, 2024 • edited Loading

Codecov Report

theweipeng commented Dec 17, 2024

nodejs-github-bot commented Dec 17, 2024

addaleax left a comment

Choose a reason for hiding this comment

addaleax commented Dec 17, 2024 • edited Loading

theweipeng commented Dec 18, 2024

addaleax left a comment

Choose a reason for hiding this comment

theweipeng commented Dec 26, 2024

nodejs-github-bot commented Dec 26, 2024

theweipeng commented Dec 27, 2024

nodejs-github-bot commented Dec 27, 2024

addaleax left a comment

Choose a reason for hiding this comment

theweipeng commented Dec 28, 2024

nodejs-github-bot commented Dec 29, 2024

jasnell left a comment

Choose a reason for hiding this comment

theweipeng commented Dec 31, 2024

targos left a comment

Choose a reason for hiding this comment

theweipeng commented Dec 31, 2024

targos commented Dec 31, 2024 • edited Loading

theweipeng commented Dec 31, 2024

addaleax commented Jan 3, 2025

theweipeng commented Jan 3, 2025

nodejs-github-bot commented Jan 3, 2025

addaleax commented Jan 3, 2025 • edited Loading

theweipeng commented Jan 4, 2025

theweipeng commented Dec 5, 2024 •

edited

Loading

codecov bot commented Dec 5, 2024 •

edited

Loading

addaleax commented Dec 17, 2024 •

edited

Loading

targos commented Dec 31, 2024 •

edited

Loading

addaleax commented Jan 3, 2025 •

edited

Loading