Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: add a method to get string encoding info #56147

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

theweipeng
Copy link

@theweipeng theweipeng commented Dec 5, 2024

util: Add a method to get the encoding information of a string.

Currently, we should check the encoding of the string before we use buffer.write("string", 'latin1') and buffer.write("string", 'utf16le') to write to the buffer.
This PR adds a method to return the encoding information from V8.

Closes: #56090

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. needs-ci PRs that need a full CI run. util Issues and PRs related to the built-in util module. labels Dec 5, 2024
Copy link

codecov bot commented Dec 5, 2024

Codecov Report

Attention: Patch coverage is 89.65517% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.54%. Comparing base (3c2da4b) to head (bd086ce).
Report is 171 commits behind head on main.

Files with missing lines Patch % Lines
src/node_v8.cc 81.25% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #56147      +/-   ##
==========================================
+ Coverage   87.99%   88.54%   +0.55%     
==========================================
  Files         656      657       +1     
  Lines      188999   190794    +1795     
  Branches    35981    36611     +630     
==========================================
+ Hits       166301   168933    +2632     
+ Misses      15865    15044     -821     
+ Partials     6833     6817      -16     
Files with missing lines Coverage Δ
lib/v8.js 99.34% <100.00%> (+0.70%) ⬆️
src/node_external_reference.h 100.00% <ø> (ø)
src/node_v8.cc 90.67% <81.25%> (-0.52%) ⬇️

... and 138 files with indirect coverage changes

@theweipeng theweipeng changed the title util: add a method to get string encoding info src: add a method to get string encoding info Dec 16, 2024
@theweipeng
Copy link
Author

I've been waiting for some feedback on this PR, could someone take a look at this ? I'd appreciate any feedback you can provide. Thanks! @nodejs-github-bot

@jakecastelli jakecastelli added request-ci Add this label to start a Jenkins CI on a PR. review wanted PRs that need reviews. labels Dec 17, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 17, 2024
@nodejs-github-bot
Copy link
Collaborator

@jazelly jazelly requested a review from joyeecheung December 17, 2024 14:40
Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add an API like this, keep in mind that this

  • does not return any information about the string itself, which does not have an encoding per se, only its underlying representation in the JS engine
  • does not provide reliable output

You'll probably want to rename it and modify the documentation as needed.

It's hard to see an actual use case for this API, though. Getting this information can be useful when dealing with JS strings in C++, but there this information is already directly available.

@addaleax
Copy link
Member

addaleax commented Dec 17, 2024

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

@theweipeng
Copy link
Author

Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the v8 API makes a bit more sense, because it really is engine-specific, and she suggests a better name for it that reflects the subtleties involved here (isOneByteRepresentation), plus points out that utf16le isn't accurate because we'd need to take platform endianness into account.

So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them.

Understood, I'll correct the API placement. Thanks for pointing it out.

@jazelly jazelly removed the request for review from joyeecheung December 18, 2024 04:24
@theweipeng theweipeng force-pushed the issue_56090 branch 3 times, most recently from 8a6f12e to 6f30b0e Compare December 21, 2024 14:20
Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API

doc/api/v8.md Outdated Show resolved Hide resolved
@theweipeng
Copy link
Author

Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API

Could you please review the document changes and trigger the code CI?

@addaleax addaleax added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 26, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 26, 2024
@nodejs-github-bot
Copy link
Collaborator

@theweipeng
Copy link
Author

Looks like a flaky test. Needs a re-run I think. @nodejs-github-bot

@nodejs-github-bot
Copy link
Collaborator

Copy link
Member

@addaleax addaleax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and your review has been addressed

@theweipeng Yeah, the code looks good here. I've left two minor notes on the documentation, but definitely nothing blocking.

However, I've suggested above twice to provide a use case for this method -- i.e. a reason or example that explains why this should be part of the Node.js API. I think it's okay to consider that a requirement for merging a PR.

doc/api/v8.md Outdated Show resolved Hide resolved
doc/api/v8.md Outdated Show resolved Hide resolved
@theweipeng
Copy link
Author

@nodejs-github-bot Needs a re-run please, I have corrected the documentation.

@addaleax addaleax added the request-ci Add this label to start a Jenkins CI on a PR. label Dec 29, 2024
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Dec 29, 2024
@nodejs-github-bot
Copy link
Collaborator

doc/api/v8.md Outdated Show resolved Hide resolved
doc/api/v8.md Outdated Show resolved Hide resolved
doc/api/v8.md Outdated Show resolved Hide resolved
Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same kinds of concerns here as @addaleax @joyeecheung have raised about how specific to v8's current internal representation this is and that it could potentially come back to bite us later if v8 decided to change that, but overall the code and docs here LGTM.

I do worry that for smaller strings, the overahead of performing the check might actually be too high to worry about. This almost borders on being too niche of a use case but overall I've got no reason to block. Approving based on the code changes looking good.

@theweipeng
Copy link
Author

@addaleax I've updated the PR based on your feedback. Could you please take a moment to check if it's ready to merge or if there's anything else that needs attention?

targos
targos previously requested changes Dec 31, 2024
Copy link
Member

@targos targos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still no clear use case for this. The OP mentions ucs2Write and latin1WriteStatic but these are internal methods so cannot be used as arguments to add this method publicly.

@theweipeng
Copy link
Author

There is still no clear use case for this. The OP mentions ucs2Write and latin1WriteStatic but these are internal methods so cannot be used as arguments to add this method publicly.

Sorry, I had a problem with the expression in the OP. I didn't think to use these private APIs; I used buffer.write in the documentation. I have corrected the OP.

@targos
Copy link
Member

targos commented Dec 31, 2024

Can you please answer @addaleax's question ? #56147 (review)

@theweipeng
Copy link
Author

Can you please answer @addaleax's question ? #56147 (review)

Following her suggestion in the comment #56147 (comment), I have written an example in the documentation to explain why we need this API. https://github.com/nodejs/node/pull/56147/files#diff-fc79b2d1ad702cfaf107d5880b73e8360b36273edda73f128c00641637435c3cR1368

@addaleax
Copy link
Member

addaleax commented Jan 3, 2025

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

@theweipeng
Copy link
Author

@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard.

Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits).

Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me.

Thank you for your patient guidance, I've learned a lot from it. I will simplify that document. Regarding your test, may I take a look at your test code? Because the improvement is quite noticeable on my machine.
Here is my test code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, Buffer.byteLength(input, 'utf8'));
        bf.write(input, 4, 'utf8');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

And here are the results:

before short latin1: 52.94ms
after short latin1: 21.978ms


before short utf16: 123.66ms
after short utf16: 55.009ms


before long latin1: 35.731ms
after long latin1: 24.655ms


before long utf16: 353.766ms
after long utf16: 58.19ms

I think the improvement comes from two aspects: one is the calculation of the byte length of strings, and the other is copying.

@targos targos dismissed their stale review January 3, 2025 07:33

answered

@legendecas legendecas added commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. request-ci Add this label to start a Jenkins CI on a PR. labels Jan 3, 2025
@github-actions github-actions bot removed the request-ci Add this label to start a Jenkins CI on a PR. label Jan 3, 2025
@nodejs-github-bot
Copy link
Collaborator

@addaleax
Copy link
Member

addaleax commented Jan 3, 2025

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

@theweipeng
Copy link
Author

@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead).

I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16

My machine: Apple M2 Pro 16GB

Here is my code:

const { isStringOneByteRepresentation } = require("v8");

const bf = Buffer.alloc(1000);

function benchmark(input, topic) {
    console.time("before " + topic);
    for (let index = 0; index < 999999; index++) {
        bf.writeUint32LE(0, input.length * 2);
        bf.write(input, 4, 'utf16le');
    }
    console.timeEnd("before " + topic);
    
    console.time("after " + topic);
    for (let index = 0; index < 999999; index++) {
        if (isStringOneByteRepresentation(input)) {
            bf.writeUint32LE(0, input.length);
            bf.write(input, 4, 'latin1');
        } else {
            bf.writeUint32LE(0, input.length * 2);
            bf.write(input, 4, 'utf16le');
        }
    }
    console.timeEnd("after " + topic);
    console.log("\n");
}

benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")


benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16")

Here are the results:

before short latin1: 73.69ms
after short latin1: 22.572ms


before short utf16: 60.322ms
after short utf16: 57.602ms


before long latin1: 62.184ms
after long latin1: 25.765ms


before long utf16: 61.797ms
after long utf16: 56.201ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. commit-queue-squash Add this label to instruct the Commit Queue to squash all the PR commits into the first one. needs-ci PRs that need a full CI run. review wanted PRs that need reviews. util Issues and PRs related to the built-in util module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Add a method to check if a string is a OneByteString
7 participants