Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input stream unusable after reading a JSON document #47

Closed
daggaz opened this issue Dec 5, 2022 · 8 comments · Fixed by #50
Closed

Input stream unusable after reading a JSON document #47

daggaz opened this issue Dec 5, 2022 · 8 comments · Fixed by #50

Comments

@daggaz
Copy link
Collaborator

daggaz commented Dec 5, 2022

Hey,

So I was playing with the following code:

from io import StringIO

from json_stream.tokenizer import tokenize

import json_stream


def json_document_iterator(f):
    try:
        while True:
            yield json_stream.load(f, tokenizer=tokenize)
    except StopIteration:
        pass


data = """
{"bob": 1, "bobby": 4}
{"bobo": 3}
"""


f = StringIO(data)
for document in json_document_iterator(f):
    for k, v in document.items():
        print(f"{k} = {v}")
    print("end of document")

Output:

bob = 1
bobby = 4
end of document
bobo = 3
end of document

If I use the default (rust) tokenizer instead, I only get the first document.

It appears the whole stream is consumed by the rust tokenize before returning?

This prints nothing:

f = StringIO(data)
list(json_stream.load(f))
print(f.read())
@smheidrich
Copy link
Owner

smheidrich commented Dec 5, 2022

Good point, I hadn't thought about that at all.

It seems like this is caused by the additional buffering that happens in Rust and a trivial way to prevent it is to read characters one at a time from the passed-in Python stream. Unfortunately, doing that makes performance tank pretty hard and the speedup is lowered to a mere ~2-3x on my machine (and even lower in CI, it seems). That makes sense because reading from the Python stream in Rust currently involves calling its read method just as if this was done in Python, so doing that once per character is pretty bad.

I guess another idea would be to let the buffering overshoot while parsing and rewind the Python stream back to the end of the document using seek() when it's done. But of course, that doesn't work for all streams because not all of them are seekable.

So maybe both solutions should be combined:

  • Tokenizer checks if stream is seekable. If so, it uses the default (quite large) buffer size for maximum performance and seek()s back to the end of the document when done.
  • If stream is not seekable, it reads only one character at a time...
  • ... unless the user asked via some parameter (buffer_unseekable=True or something) to override this behavior (which would only be useful if someone has a stream that is not seekable and they're also definitely not interested in reading the stream beyond the end of a JSON document and want max performance).

Does that sound reasonable?

This seems like a problem that others must have had before so maybe I'll do a bit of searching before deciding.

@smheidrich
Copy link
Owner

smheidrich commented Dec 6, 2022

Ugh... I was wondering whether Python's seek() on text streams deals in bytes or characters and found this instead (emphasis mine):

In text files (those opened without a b in the mode string), only seeks relative to the beginning of the file are allowed (the exception being seeking to the very file end with seek(0, 2)) and the only valid offset values are those returned from the f.tell(), or zero. Any other offset value produces undefined behaviour.

AFAICT that makes the above idea quite complicated to implement, at least if we want to be correct and not rely on implementation details that make the undefined behavior well-defined in practice...
If we read() chunks of several characters at once, then the only positions that tell() will tell us about are at the boundaries of these chunks. But the end of the document will not in general be at such a boundary. So I guess the only option is to call tell() after every read() and save the start position of the last chunk, then once the end of the document is reached rewind the stream back to the beginning of the last chunk, then go through it again character by character and find the end of the document by comparing to a character-based position of the end we had to save as well read() however many characters we read from the beginning of the last chunk until the end of the document again to arrive at that position.

And for binary streams we'd need separate logic that skips all this overcomplicated nonsense and just applies the idea from the comment above 1:1.

Will have to think about this again and search some more.

@daggaz
Copy link
Collaborator Author

daggaz commented Dec 6, 2022

This does indeed seems to be a pain. The python implementation uses always read(1), hence why the behaviour is so different.

I wonder if there's a more complex API we can implement for the tokenizer interface that will support returning a remainder.

If we moved the multi-document (JSON-RPC style) idea into json_stream.load() it could handle concatenating the remainder with the rest of the read stream.

Need to think of an API for load() that allows for multiple documents, or perhaps a new API json_stream.load_many() or something.

@smheidrich
Copy link
Owner

smheidrich commented Dec 6, 2022

I wouldn't put it in load, that seems even messier. The behavior of the tokenizer to leave the cursor in a place completely unrelated to the end of what it was parsing seems "wrong" to me regardless of whether this has to be avoided to support multiple documents or not, so IMO that should be fixed here no matter what.

I had some ideas for abstractions that will hopefully make this look reasonably clean in the code so I'll get started today.

@smheidrich
Copy link
Owner

smheidrich commented Dec 14, 2022

After a week of fighting with Rust I'm finally making progress in #50. But when I wrote the comments above I clearly wasn't thinking straight because there is of course no way to solve the issue in the tokenizer only, because the tokenizer doesn't even know when the document ends... So this will need support from the "outside" (json-stream) after all.

My currently favored approach would be this: RustTokenizer will get a method park_cursor() (open to better names, or maybe that should just be the __exit__() of a context manager?) that undoes the readahead buffering if possible by seek()ing the underlying Python stream back to the position after the last processed (not read!) character. json-stream calls this whenever it reaches the end of a JSON document, so the cursor of the underlying stream is brought back there. If seeking is not possible, this will be a no-op because RustTokenizer will have fallen back to reading one character at a time anyway, so json-stream can call it unconditionally.

What's still missing from #50 is support for Python byte streams (currently only supports text streams) and the 1-character-at-time fallback for unseekable streams as described above, but I think neither of these will be difficult.

@smheidrich
Copy link
Owner

smheidrich commented Dec 14, 2022

@daggaz By the way, you have the exact same issue with the pure-Python tokenizer if someone gives it a byte stream, because the Python tokenizer then wraps it in a TextIOWrapper which also buffers ahead by 8 KiB, even if you do read(1). You can replace StringIO with BytesIO and the string with a byte string in your original snippet to demonstrate this.

I was considering just not implementing handling for byte streams in Rust and instead wrapping them in TextIOWrapper as well before they make it into the RustTokenizer, but given the above, that's probably not a good idea, unless we find a way to fix TextIOWrapper somehow or find a non-buffering alternative (that should be possible for valid UTF-8 because the first byte of a character tells you how many bytes you still have to read to get the whole character - not sure if there are any encodings that require reading ahead to know if you've got the whole character, but I guess they wouldn't matter). Although I guess an unbuffered TextIOWrapper would be just as slow as doing read(1) from Rust again, so that would help the pure-Python tokenizer but not the Rust one...

TextIOWrapper has a hidden attribute to set the buffer chunk size, but if I set it there are still other remaining issues ("I/O operation on closed file" exception because it closes the underlying buffer when it's cleaned up... annoying).

@smheidrich
Copy link
Owner

@daggaz Merry Christmas 🎄 & sorry it took so long but I'm finally pretty much done with #50, all that's needed now are decisions on the interface by which json-stream can use it.

I went with the park_cursor approach as described above for seekable streams, while for unseekable streams, the default is the (really slow) approach that reads 1 character or byte at a time, with users being able to choose to go with buffering reads instead (if they're fine with the cursor ending up wherever) by instantiating the tokenizer as RustTokenizer(stream, correct_cursor=False). I also added a remainder property similar to your suggestion above that works in any approach, just in case you want to go with that instead or an end user needs it because they went with correct_cursor=False.

Because RustTokenizer has gained 2 new methods/properties and 1 new constructor parameter with this, I took the opportunity to make its docstrings work properly, so if you have the version from this PR's branch installed, you can do help("json_stream_rs_tokenizer.RustTokenizer") to view docs for all of these things.

I also opened a PR showing an example of how park_cursor could be used in json-stream: daggaz/json-stream#31
But you'll probably have a better idea that doesn't involve putting levels=something everywhere...

@daggaz
Copy link
Collaborator Author

daggaz commented Jan 6, 2023

Hey,

Merry Christmas and a Happy New Year to you.

With work and holidays and family, I've not had a moment to look at this.

I promise too look properly soon!

Sorry for the holding message...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants