-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more efficient stream parsing #212
Comments
You could try CSV/TSV formats for selects and do your own row parsing after calling I am also considering adding |
Thanks for your advice. I wrote a simple datatype-aware parser to split TSV and convert "numbers" to numbers, and speed improved by x10 at least, amazing. |
@movy, I think RowBinary implementation could yield even more significant gains. See https://github.com/luizperes/simdjson_nodejs#benchmarks - simdjson is 1.5-2x faster than JSON.parse. It is probably worth adding as an optional switch, I will eventually look at what's possible here. I am considering, however, adding RowBinary alongside the third "get data" method to the Row object: something like |
Yes, |
@movy, I am curious to run some internal tests on this (JSON vs. Text + "manual" numbers parsing vs. RowBinary + row as an array). Can you please contact me in the community Slack and share the code snippets and, maybe, a sample dataset (or its definitions) for your scenarios? |
@slvrtrn @movy running into a similar scenario where JSON.parse isn't efficient enough for the result set I'm working with. Played around with the options described in this thread and have measured some improvement (particularly using simdjson instead of JSON.parse) but was curious to see how your testing is going/if there's an optimized strategy for these cases. |
I wanted to post an update here, as this issue looks stale while it is not. I've been studying RowBinary format and writing an experimental implementation of it with select streaming support out of the box for quite some time already; current progress is here: 1.0.0...row_binary At the moment of writing this, that branch supports decoding a minimal set of useful data types:
Currently in progress, as I consider the following types should be supported straight away (there are drafts decoders/placeholders for some already): Decimal, Enum, UUID, FixedString, DateTime(64), Array, Map, Tuple. NB: under the hood, Date is just a UInt16 (number of days since Unix Epoch), and Date32 is just an Int32 (also days, but before/after). JS Date is chosen only as the default option for these types; there will be a mapping option to convert the "days" numeric value to any other object. For example, consider the Date type. The mapper is just: <T>(daysSinceEpoch: number) => T And, by default, if no custom mapper is provided, it is something like this: const defaultDateMapper = (daysSinceEpoch: number) => new Date(daysSinceEpoch * DayMillis) Similarly, Date32, DateTime, DateTime64 (especially DateTime(9) as it requires proper nanoseconds support), and Decimal will have optional mappers support from their decoded "raw" representation. @movy @chrisjh (and @cjk, I saw your reaction, too!), if you are still interested, I'd be curious to learn a few things:
|
Let's continue tracking this issue here: #216; please don't hesitate to comment. Cheers. |
Hello,
I continuously stream massive amounts of market data from ClickHouse and after profiling my app I discovered that most ticks are spent in stream parsing, i..e:
I briefly checked the module source code, and looks like quite expensive
JSON.parse()
is used internally even when we calltext()
.My question would be **is there a faster way to parse incoming stream maybe by using some raw output format? ** Or am I using reading streams wrong?
The text was updated successfully, but these errors were encountered: