Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added buffered reading to tokenizer #46

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

added buffered reading to tokenizer #46

wants to merge 6 commits into from

Conversation

daggaz
Copy link
Owner

@daggaz daggaz commented Jun 14, 2023

In response to #45

@smheidrich
Copy link
Contributor

smheidrich commented Jul 2, 2023

As you already mentioned in #45 (comment), this partly intersects with the changes that would be needed to fix #30. To be more precise, not doing any buffering is the most simple (but also least performant, at least in Rust) way of fixing #30, as I mentioned here, and my suggested fix for #30 (smheidrich/py-json-stream-rs-tokenizer#50) uses it when cursor positions in sync with the tokenization progress are requested but the underlying Python stream isn't seekable.

So for the Rust tokenizer I'm currently thinking about merging smheidrich/py-json-stream-rs-tokenizer#50 first (except with the new constructor parameter correct_cursor introduced there as keyword-only) because it would facilitate both the introduction of user-defined buffering like here and the fix for #30. But it would also make the two tokenizers' code structures diverge a bit more, as buffering in smheidrich/py-json-stream-rs-tokenizer#50 is handled in a separate struct instead of directly as part of the tokenizer. I guess some divergence has to be expected anyway, though, so not sure how bad this would be.

@daggaz
Copy link
Owner Author

daggaz commented Jul 3, 2023

@smheidrich
So, I changed the github build process to run the whole test suite both with and without the rust tokenizer, and I discovered a hidden bug in the python tokenizer where it wasn't completing the state machine if the buffer was empty and the last state didn't advance the stream.

This was fixed in this commit.

If you've already ported this code, you will also have ported this bug!

@daggaz
Copy link
Owner Author

daggaz commented Jul 3, 2023

I have also, in response to your comment in the rust repo committed a proposed new interface for rust_tokenizer_or_raise()

@smheidrich
Copy link
Contributor

smheidrich commented Jul 12, 2023

All right so I've finally gotten around to writing the parallel PR to this for the Rust tokenizer: smheidrich/py-json-stream-rs-tokenizer#87

I tested it locally with the test case you modified here and there are no errors so I guess it basically works? Although there are a lot of different cases depending on e.g. whether the underlying Python stream returns strings or bytes, whether it's seekable, etc., so I might write another test on my end for those.

UPDATE: Tests on my side are done now as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants