-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Input stream unusable after reading a JSON document #47
Comments
Good point, I hadn't thought about that at all. It seems like this is caused by the additional buffering that happens in Rust and a trivial way to prevent it is to read characters one at a time from the passed-in Python stream. Unfortunately, doing that makes performance tank pretty hard and the speedup is lowered to a mere ~2-3x on my machine (and even lower in CI, it seems). That makes sense because reading from the Python stream in Rust currently involves calling its I guess another idea would be to let the buffering overshoot while parsing and rewind the Python stream back to the end of the document using So maybe both solutions should be combined:
Does that sound reasonable? This seems like a problem that others must have had before so maybe I'll do a bit of searching before deciding. |
Ugh... I was wondering whether Python's
AFAICT that makes the above idea quite complicated to implement, at least if we want to be correct and not rely on implementation details that make the undefined behavior well-defined in practice... And for binary streams we'd need separate logic that skips all this overcomplicated nonsense and just applies the idea from the comment above 1:1. Will have to think about this again and search some more. |
This does indeed seems to be a pain. The python implementation uses always I wonder if there's a more complex API we can implement for the tokenizer interface that will support returning a remainder. If we moved the multi-document (JSON-RPC style) idea into Need to think of an API for |
I wouldn't put it in I had some ideas for abstractions that will hopefully make this look reasonably clean in the code so I'll get started today. |
After a week of fighting with Rust I'm finally making progress in #50. But when I wrote the comments above I clearly wasn't thinking straight because there is of course no way to solve the issue in the tokenizer only, because the tokenizer doesn't even know when the document ends... So this will need support from the "outside" (json-stream) after all. My currently favored approach would be this: What's still missing from #50 is support for Python byte streams (currently only supports text streams) and the 1-character-at-time fallback for unseekable streams as described above, but I think neither of these will be difficult. |
@daggaz By the way, you have the exact same issue with the pure-Python tokenizer if someone gives it a byte stream, because the Python tokenizer then wraps it in a I was considering just not implementing handling for byte streams in Rust and instead wrapping them in
|
@daggaz Merry Christmas 🎄 & sorry it took so long but I'm finally pretty much done with #50, all that's needed now are decisions on the interface by which I went with the Because I also opened a PR showing an example of how |
Hey, Merry Christmas and a Happy New Year to you. With work and holidays and family, I've not had a moment to look at this. I promise too look properly soon! Sorry for the holding message... |
Hey,
So I was playing with the following code:
Output:
If I use the default (rust) tokenizer instead, I only get the first document.
It appears the whole stream is consumed by the rust tokenize before returning?
This prints nothing:
The text was updated successfully, but these errors were encountered: