-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
src: add a method to get string encoding info #56147
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #56147 +/- ##
==========================================
+ Coverage 87.99% 88.54% +0.55%
==========================================
Files 656 657 +1
Lines 188999 190794 +1795
Branches 35981 36611 +630
==========================================
+ Hits 166301 168933 +2632
+ Misses 15865 15044 -821
+ Partials 6833 6817 -16
|
8944a8c
to
a4a445a
Compare
I've been waiting for some feedback on this PR, could someone take a look at this ? I'd appreciate any feedback you can provide. Thanks! @nodejs-github-bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to add an API like this, keep in mind that this
- does not return any information about the string itself, which does not have an encoding per se, only its underlying representation in the JS engine
- does not provide reliable output
You'll probably want to rename it and modify the documentation as needed.
It's hard to see an actual use case for this API, though. Getting this information can be useful when dealing with JS strings in C++, but there this information is already directly available.
Actually, I see that @joyeecheung already left a number of helpful comments on the original ticket. Moving this to the So – Joyee made a lot of good suggestions here already, and you'll probably just want to incorporate them. |
Understood, I'll correct the API placement. Thanks for pointing it out. |
8a6f12e
to
6f30b0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the 'changes requested' marker but this should have better documentation and, ideally, a use case that explains why we're adding this to the API
6f30b0e
to
6eaf64c
Compare
Could you please review the document changes and trigger the code CI? |
Looks like a flaky test. Needs a re-run I think. @nodejs-github-bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and your review has been addressed
@theweipeng Yeah, the code looks good here. I've left two minor notes on the documentation, but definitely nothing blocking.
However, I've suggested above twice to provide a use case for this method -- i.e. a reason or example that explains why this should be part of the Node.js API. I think it's okay to consider that a requirement for merging a PR.
6eaf64c
to
65e06d9
Compare
@nodejs-github-bot Needs a re-run please, I have corrected the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same kinds of concerns here as @addaleax @joyeecheung have raised about how specific to v8's current internal representation this is and that it could potentially come back to bite us later if v8 decided to change that, but overall the code and docs here LGTM.
I do worry that for smaller strings, the overahead of performing the check might actually be too high to worry about. This almost borders on being too niche of a use case but overall I've got no reason to block. Approving based on the code changes looking good.
@addaleax I've updated the PR based on your feedback. Could you please take a moment to check if it's ready to merge or if there's anything else that needs attention? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still no clear use case for this. The OP mentions ucs2Write
and latin1WriteStatic
but these are internal methods so cannot be used as arguments to add this method publicly.
Sorry, I had a problem with the expression in the OP. I didn't think to use these private APIs; I used |
Can you please answer @addaleax's question ? #56147 (review) |
Following her suggestion in the comment #56147 (comment), I have written an example in the documentation to explain why we need this API. https://github.com/nodejs/node/pull/56147/files#diff-fc79b2d1ad702cfaf107d5880b73e8360b36273edda73f128c00641637435c3cR1368 |
@theweipeng Sure, but to be clear, I wasn't asking for a complex example in the documentation (because complex documentation can easily distract from the important bit – what the method actually does). I'd say that earlier versions of the documentation in this PR were better in that regard. Based on the documentation example here, I can measure about a 0.7–1.4% runtime performance increase with this method. I guess that's a reason to add this method – it's not a particularly significant difference, and the benefits of using less space will likely significantly outweigh the benefits of faster data copying (so if this is the reason to add this, and if we keep this full-featured example in the documentation, we should at least be honest about it and not refer to significant performance benefits). Either way, I'm not blocking anything here. If this is good with @jasnell it's good with me. |
Thank you for your patient guidance, I've learned a lot from it. I will simplify that document. Regarding your test, may I take a look at your test code? Because the improvement is quite noticeable on my machine. const { isStringOneByteRepresentation } = require("v8");
const bf = Buffer.alloc(1000);
function benchmark(input, topic) {
console.time("before " + topic);
for (let index = 0; index < 999999; index++) {
bf.writeUint32LE(0, Buffer.byteLength(input, 'utf8'));
bf.write(input, 4, 'utf8');
}
console.timeEnd("before " + topic);
console.time("after " + topic);
for (let index = 0; index < 999999; index++) {
if (isStringOneByteRepresentation(input)) {
bf.writeUint32LE(0, input.length);
bf.write(input, 4, 'latin1');
} else {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
}
console.timeEnd("after " + topic);
console.log("\n");
}
benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")
benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16") And here are the results:
I think the improvement comes from two aspects: one is the calculation of the byte length of strings, and the other is copying. |
@theweipeng I was comparing the current version against always using UTF16-LE, to be clear, not against UTF-8. Otherwise I don't think you end up with a fair comparison (UTF-8 mainly has space saving advantages, but this seems to be about runtime performance instead). |
I compared the current version against always uses UTF-16LE, and there is still a noticeable improvement when the string is in Latin1. I think this imporvement comes from encoding Latin1 to utf16 My machine: Apple M2 Pro 16GB Here is my code: const { isStringOneByteRepresentation } = require("v8");
const bf = Buffer.alloc(1000);
function benchmark(input, topic) {
console.time("before " + topic);
for (let index = 0; index < 999999; index++) {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
console.timeEnd("before " + topic);
console.time("after " + topic);
for (let index = 0; index < 999999; index++) {
if (isStringOneByteRepresentation(input)) {
bf.writeUint32LE(0, input.length);
bf.write(input, 4, 'latin1');
} else {
bf.writeUint32LE(0, input.length * 2);
bf.write(input, 4, 'utf16le');
}
}
console.timeEnd("after " + topic);
console.log("\n");
}
benchmark(new Array(1).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "short latin1");
benchmark(new Array(1).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "short utf16")
benchmark(new Array(5).fill(0).map(x => "qwertyuiopasdfghjklzxcvbnm").join(''), "long latin1");
benchmark(new Array(5).fill(0).map(x => "😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁😁").join(''), "long utf16") Here are the results:
|
util: Add a method to get the encoding information of a string.
Currently, we should check the encoding of the string before we use
buffer.write("string", 'latin1')
andbuffer.write("string", 'utf16le')
to write to the buffer.This PR adds a method to return the encoding information from V8.
Closes: #56090