Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resynchronization #18

Closed
davebenson opened this issue Aug 5, 2022 · 8 comments
Closed

Resynchronization #18

davebenson opened this issue Aug 5, 2022 · 8 comments
Assignees

Comments

@davebenson
Copy link

muon is awesome, but the one thing it is lacking versus line-by-line json is ability to seek around randomly in the stream.

the jsonl separator \n is hard to beat, but support for raw data makes that impossible.

I have found the AVRO strategy of picking a random 16-byte delimiter is nice.

I think some byte (maybe 0xf0) should indicate a delimiter follows. the first time will define the delimiter (which is chosen randomly); subsequent times should just be validated.

@vshymanskyy
Copy link
Owner

vshymanskyy commented Aug 5, 2022

@davebenson how about using a list as a root object (similar to muon chaining), and applying a 0x8B (size) tag to every element? This will make it impossible to run into clashes with a randomly selected delimiter.

@vshymanskyy
Copy link
Owner

vshymanskyy commented Aug 17, 2022

Having a random delimiter could be a good idea, however, I believe a 16-byte delimiter is an overkill.
Maybe something like: 1 tag byte + 6x payload bytes + 1 byte checksum?
And of course, the parser will have to check that a valid muon object follows.

@vshymanskyy
Copy link
Owner

vshymanskyy commented Aug 17, 2022

It might be a good idea to also allocate 1 byte for the counter (increases with each delimiter). In this case:
tag (1B) + random (5B) + counter (1B) + checksum (1B) = 8 bytes total

5 random bytes gives us 1,099,511,627,775 permutations.

@vshymanskyy vshymanskyy changed the title support for resynchronization Resynchronization Aug 17, 2022
@vshymanskyy
Copy link
Owner

vshymanskyy commented Aug 17, 2022

@davebenson please provide some real-world use cases or elaborate on the motivation of this request.
Currently, muon objects can be concatenated as-is (i.e. JSONL is an extension of JSON, but Muon doesn't need any additional separators).
Also, you can reuse Muon magic or padding tags to have some kind of separator.

@JobLeonard
Copy link

Here is an idea for supporting random delimiters without actually having to change the current format. Not sure if that is actually something to worry about (it's not like there are a ton of muon parsers out there that would break, nor that muon is considered a stable format yet), but it might help to keep the core format itself smaller.

Let's assume that we're using the chaining approach suggested in the docs for concatenating muon objects.

The main idea would be to encode the delimiter in one or more valid muon objects. Then one can inject this (sequence of) object(s) as the delimiter between the muon objects we actually care about during concatenation. A parser that knows we are using this approach could avoid allocating the delimiter objects, improving performance. Otherwise they would have to be removed after parsing. Since we're creating a list, that would mean removing every other entry.

To signal that we're using delimiter objects we could do something similar to JavaScript's "use strict"; approach. If the first item of a chained list is a magic string - e.g. "seekable"or "resynchronizable" (eight or sixteen bytes + zero terminator, respectively), it implies we are using delimiters. If we want the delimiters to stay user-definable we could say the second object then determines what the object is (but that would be incompatible with @vshymansky's idea of adding a counter to it).

Next question: which object would be most suitable for this?

Hidden: boring analysis of typed arrays, base-64 strings and LEB128 encoded values, before I realized all of which them are utterly inferior to using one or two U64 values

First, we could try a typed array. Doing so would add a storage overhead of three bytes - 84, B4, XX where the last byte is the length of the array (eight, sixteen, whatever we settle on). A significant downside would be that Typed Array allocation is both really, really slow in JavaScript, and has a relatively high memory overhead (in the order of 200 bytes), so using this object type would likely have a negative impact on performance on the parsing end when not discarding delimiters. Based on that I would advise against using them for this purpose.

One could store the separator as a base64-encoded string. That would increase the size by 1/3 since it encodes only six bits per byte, plus a final zero-terminator byte. Rounding up encoding 16 bytes would take up 23 bytes this way, and 8 bytes would take 12. A bit bigger, but on the other hand string creation and memory overhead are both a lot more optimized in JS engines.

Another option is LEB128 encoding (or whatever varint encoding Muon will settle on - see #8). That would imply an overhead of 0xBB for the start, followed by seven bits per byte - encoding 16 bytes this way would take up 20 bytes, 8 bytes would take up 11 - almost as good as typed arrays! In JavaScript this would imply creating BigInts, for which I do not know where they stand in terms of memory overhead/performance. Given that they're supposed to be used for calculations (meaning they change a lot) one would hope that at least some effort has been put into optimizing them though.

The most efficient option is to use 0xB7 followed by eight bytes. It's the cheapest in terms of added storage overhead (from 8 bytes to 10 bytes, from 16 bytes to 18 bytes), and BigInts in JavaScript should be relatively cheap to create, so that shouldn't add too much overhead when dealing with parsers oblivious to this protocol. The only slightly tricky thing is that in that case, when filtering out the delimiters afterwards, the 16-bytes option would require skipping two delimiter objects for each "real" object in our list.

Again, just some thoughts on how to support this with relatively low complexity without actually having to change the Muon format.

@vshymanskyy
Copy link
Owner

i'm not opposed to adding a specialized tag for this. we have more than enough unallocated markers. just need to understand the rationale

@vshymanskyy
Copy link
Owner

@davebenson waiting for your inputs here

@davebenson
Copy link
Author

davebenson commented Aug 20, 2022 via email

@vshymanskyy vshymanskyy closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants