Resynchronization #18

davebenson · 2022-08-05T05:51:14Z

muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.

the jsonl separator \n is hard to beat, but support for raw data makes that impossible.

I have found the AVRO strategy of picking a random 16-byte delimiter is nice.

I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.

vshymanskyy · 2022-08-05T10:33:01Z

@davebenson how about using a list as a root object (similar to muon chaining), and applying a 0x8B (size) tag to every element? This will make it impossible to run into clashes with a randomly selected delimiter.

vshymanskyy · 2022-08-17T11:39:37Z

Having a random delimiter could be a good idea, however, I believe a 16-byte delimiter is an overkill.
Maybe something like: 1 tag byte + 6x payload bytes + 1 byte checksum?
And of course, the parser will have to check that a valid muon object follows.

vshymanskyy · 2022-08-17T11:48:40Z

It might be a good idea to also allocate 1 byte for the counter (increases with each delimiter). In this case:
tag (1B) + random (5B) + counter (1B) + checksum (1B) = 8 bytes total

5 random bytes gives us 1,099,511,627,775 permutations.

vshymanskyy · 2022-08-17T13:40:26Z

@davebenson please provide some real-world use cases or elaborate on the motivation of this request.
Currently, muon objects can be concatenated as-is (i.e. JSONL is an extension of JSON, but Muon doesn't need any additional separators).
Also, you can reuse Muon magic or padding tags to have some kind of separator.

JobLeonard · 2022-08-18T12:55:38Z

Here is an idea for supporting random delimiters without actually having to change the current format. Not sure if that is actually something to worry about (it's not like there are a ton of muon parsers out there that would break, nor that muon is considered a stable format yet), but it might help to keep the core format itself smaller.

Let's assume that we're using the chaining approach suggested in the docs for concatenating muon objects.

The main idea would be to encode the delimiter in one or more valid muon objects. Then one can inject this (sequence of) object(s) as the delimiter between the muon objects we actually care about during concatenation. A parser that knows we are using this approach could avoid allocating the delimiter objects, improving performance. Otherwise they would have to be removed after parsing. Since we're creating a list, that would mean removing every other entry.

To signal that we're using delimiter objects we could do something similar to JavaScript's "use strict"; approach. If the first item of a chained list is a magic string - e.g. "seekable"or "resynchronizable" (eight or sixteen bytes + zero terminator, respectively), it implies we are using delimiters. If we want the delimiters to stay user-definable we could say the second object then determines what the object is (but that would be incompatible with @vshymansky's idea of adding a counter to it).

Next question: which object would be most suitable for this?

Hidden: boring analysis of typed arrays, base-64 strings and LEB128 encoded values, before I realized all of which them are utterly inferior to using one or two U64 values

First, we could try a typed array. Doing so would add a storage overhead of three bytes - 84, B4, XX where the last byte is the length of the array (eight, sixteen, whatever we settle on). A significant downside would be that Typed Array allocation is both really, really slow in JavaScript, and has a relatively high memory overhead (in the order of 200 bytes), so using this object type would likely have a negative impact on performance on the parsing end when not discarding delimiters. Based on that I would advise against using them for this purpose.

One could store the separator as a base64-encoded string. That would increase the size by 1/3 since it encodes only six bits per byte, plus a final zero-terminator byte. Rounding up encoding 16 bytes would take up 23 bytes this way, and 8 bytes would take 12. A bit bigger, but on the other hand string creation and memory overhead are both a lot more optimized in JS engines.

Another option is LEB128 encoding (or whatever varint encoding Muon will settle on - see #8). That would imply an overhead of 0xBB for the start, followed by seven bits per byte - encoding 16 bytes this way would take up 20 bytes, 8 bytes would take up 11 - almost as good as typed arrays! In JavaScript this would imply creating BigInts, for which I do not know where they stand in terms of memory overhead/performance. Given that they're supposed to be used for calculations (meaning they change a lot) one would hope that at least some effort has been put into optimizing them though.

The most efficient option is to use 0xB7 followed by eight bytes. It's the cheapest in terms of added storage overhead (from 8 bytes to 10 bytes, from 16 bytes to 18 bytes), and BigInts in JavaScript should be relatively cheap to create, so that shouldn't add too much overhead when dealing with parsers oblivious to this protocol. The only slightly tricky thing is that in that case, when filtering out the delimiters afterwards, the 16-bytes option would require skipping two delimiter objects for each "real" object in our list.

Again, just some thoughts on how to support this with relatively low complexity without actually having to change the Muon format.

vshymanskyy · 2022-08-18T20:48:46Z

i'm not opposed to adding a specialized tag for this. we have more than enough unallocated markers. just need to understand the rationale

vshymanskyy · 2022-08-19T12:15:47Z

@davebenson waiting for your inputs here

davebenson · 2022-08-20T02:46:48Z

I think it's probably fine to not implement this stuff in core muon. Probably searching, indexing, and resynchronization are beyond the core standard. (JSONL isn't really an extension of JSON; it's a way of framing (aka packetizing) json that provides certain conveniences that raw concatenated json doesn't have. Assuming that your stream is objects and arrays, JSON doesn't need end-of-record delimiters either) But, JSONL is binary-searchable -- assuming the lines are sorted. And files can be broken apart without reading the whole file - which is important if you want to get parallelism in processing a single file. And it is resynchronizable as well, meaning that you can seek midway into a huge file, and also that corrupt entries can be skipped.

…

On Fri, Aug 19, 2022 at 5:15 AM Volodymyr Shymanskyy < ***@***.***> wrote: @davebenson <https://github.com/davebenson> waiting for your inputs here — Reply to this email directly, view it on GitHub <#18 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACNSBSF64GG26ABCM7ZW63VZ53H7ANCNFSM55U3VHCQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

vshymanskyy added question feature labels Aug 17, 2022

vshymanskyy assigned davebenson Aug 17, 2022

vshymanskyy changed the title ~~support for resynchronization~~ Resynchronization Aug 17, 2022

vshymanskyy closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resynchronization #18

Resynchronization #18

davebenson commented Aug 5, 2022

vshymanskyy commented Aug 5, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading

JobLeonard commented Aug 18, 2022

vshymanskyy commented Aug 18, 2022

vshymanskyy commented Aug 19, 2022

davebenson commented Aug 20, 2022 via email

Resynchronization #18

Resynchronization #18

Comments

davebenson commented Aug 5, 2022

vshymanskyy commented Aug 5, 2022 • edited Loading

vshymanskyy commented Aug 17, 2022 • edited Loading

vshymanskyy commented Aug 17, 2022 • edited Loading

vshymanskyy commented Aug 17, 2022 • edited Loading

JobLeonard commented Aug 18, 2022

vshymanskyy commented Aug 18, 2022

vshymanskyy commented Aug 19, 2022

davebenson commented Aug 20, 2022 via email

vshymanskyy commented Aug 5, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading

vshymanskyy commented Aug 17, 2022 •

edited

Loading