added buffered reading to tokenizer #46

daggaz · 2023-06-14T22:53:08Z

In response to #45

smheidrich · 2023-07-02T15:11:19Z

As you already mentioned in #45 (comment), this partly intersects with the changes that would be needed to fix #30. To be more precise, not doing any buffering is the most simple (but also least performant, at least in Rust) way of fixing #30, as I mentioned here, and my suggested fix for #30 (smheidrich/py-json-stream-rs-tokenizer#50) uses it when cursor positions in sync with the tokenization progress are requested but the underlying Python stream isn't seekable.

So for the Rust tokenizer I'm currently thinking about merging smheidrich/py-json-stream-rs-tokenizer#50 first (except with the new constructor parameter correct_cursor introduced there as keyword-only) because it would facilitate both the introduction of user-defined buffering like here and the fix for #30. But it would also make the two tokenizers' code structures diverge a bit more, as buffering in smheidrich/py-json-stream-rs-tokenizer#50 is handled in a separate struct instead of directly as part of the tokenizer. I guess some divergence has to be expected anyway, though, so not sure how bad this would be.

# Conflicts: # src/json_stream/tests/test_buffering.py

…es missed

daggaz · 2023-07-03T19:30:38Z

@smheidrich
So, I changed the github build process to run the whole test suite both with and without the rust tokenizer, and I discovered a hidden bug in the python tokenizer where it wasn't completing the state machine if the buffer was empty and the last state didn't advance the stream.

This was fixed in this commit.

If you've already ported this code, you will also have ported this bug!

daggaz · 2023-07-03T19:37:01Z

I have also, in response to your comment in the rust repo committed a proposed new interface for rust_tokenizer_or_raise()

smheidrich · 2023-07-12T00:45:26Z

All right so I've finally gotten around to writing the parallel PR to this for the Rust tokenizer: smheidrich/py-json-stream-rs-tokenizer#87

I tested it locally with the test case you modified here and there are no errors so I guess it basically works? Although there are a lot of different cases depending on e.g. whether the underlying Python stream returns strings or bytes, whether it's seekable, etc., so I might write another test on my end for those.

UPDATE: Tests on my side are done now as well.

daggaz mentioned this pull request Jun 14, 2023

Add option to receive strings as file-like streams #45

Open

added buffered reading to tokenizer

524ceaf

daggaz force-pushed the buffering branch from bd634ff to 524ceaf Compare June 30, 2023 09:36

daggaz added 5 commits July 3, 2023 20:06

Merge branch 'master' into buffering

f05d261

# Conflicts: # src/json_stream/tests/test_buffering.py

Merge branch 'master' into buffering

4a86cae

fixed buffered python tokenizer bug causing final state to be sometim…

d84e7b8

…es missed

fixed buffering tests merged from master

77fbc99

changed interface for getting default tokenizer

a995eb5

daggaz mentioned this pull request Jul 3, 2023

Add feature flags: buffering & strings as files smheidrich/py-json-stream-rs-tokenizer#85

Merged

smheidrich mentioned this pull request Jul 11, 2023

Implement user-configurable buffering smheidrich/py-json-stream-rs-tokenizer#87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added buffered reading to tokenizer #46

added buffered reading to tokenizer #46

daggaz commented Jun 14, 2023 •

edited

Loading

smheidrich commented Jul 2, 2023 •

edited

Loading

daggaz commented Jul 3, 2023

daggaz commented Jul 3, 2023

smheidrich commented Jul 12, 2023 •

edited

Loading

added buffered reading to tokenizer #46

Are you sure you want to change the base?

added buffered reading to tokenizer #46

Conversation

daggaz commented Jun 14, 2023 • edited Loading

smheidrich commented Jul 2, 2023 • edited Loading

daggaz commented Jul 3, 2023

daggaz commented Jul 3, 2023

smheidrich commented Jul 12, 2023 • edited Loading

daggaz commented Jun 14, 2023 •

edited

Loading

smheidrich commented Jul 2, 2023 •

edited

Loading

smheidrich commented Jul 12, 2023 •

edited

Loading