Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to receive strings as file-like streams #45

Open
RamiAwar opened this issue Jun 3, 2023 · 22 comments
Open

Add option to receive strings as file-like streams #45

RamiAwar opened this issue Jun 3, 2023 · 22 comments

Comments

@RamiAwar
Copy link

RamiAwar commented Jun 3, 2023

I noticed this in the future work section: supporting long string values as streams themselves. Is this already implemented by any chance?

If not, do you have a plan on how you're going to implement it? Maybe you can point me in the right direction and I can give it a try.

My use-case is streaming chunks of a json string and reading the key values in a "persistent" way as described in your docs, but also partial values. For example if I were to receive {"name": "John Doe"} then I first receive {"name": "J then ohn Doe"},
and I want to be able to iterate over it in Python like:

for x in response["name"]:
    print(x)

# J
# ohn Doe

Or some similar API that achieves the same thing.

@daggaz
Copy link
Owner

daggaz commented Jun 9, 2023

Hey, thanks for the issue.

If you're looking for something that's going to stream chunks as they are received exactly as it was chunked in network packets, then this isn't going to be the library for you.

However, assuming what you really want is for partially transmitted string data (that is embedded in json) to be made available as soon as possible, then something might be able to be done, as per my "future work" idea.

I think the best API would be if a file-like object was returned in place of the string. You can then read this like any file, and it will EOF when the particular json string is completed.

The API would be something like:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
while s := long_string.read(1024):
    print(s)

I will think about this...

@daggaz
Copy link
Owner

daggaz commented Jun 9, 2023

@smheidrich I feel this this might end up in significant changes to the python tokenizer, which implies work for you, so I'll only proceed if you think you can support this in the rust tokenizer too.

@smheidrich
Copy link
Contributor

@daggaz Sure, that looks like it should be possible 👍

@daggaz
Copy link
Owner

daggaz commented Jun 14, 2023

In pursuance of this, I have just created two PRs.

The first, #46, changes the tokenizer to read the stream into a buffer instead of reading a character at a time

The second builds on this buffered to allow strings to be return as file-like objects, basically implementing the API above.

@smheidrich there are some pretty big changes here so we can only proceed if you agree

@smheidrich
Copy link
Contributor

@daggaz I'll have a look as soon as possible and try to write an analogous MR, but I have to do some other stuff first (namely fix the build and align the Unicode surrogate pair code with #42), so it will take a while.

@RamiAwar
Copy link
Author

RamiAwar commented Jun 16, 2023

Thanks for the quick response! A file-like will get the job done for me. Much cleaner than the quick implementation I went for.

Would it be possible to also wrap this file-like obj with a generator? Then it's even easier to consume. Do you see any downsides with that vs the file-like read?

@daggaz
Copy link
Owner

daggaz commented Jun 16, 2023

@smheidrich it occurs to me that this option should only be available for transient streams (or...we make them seekable in that case?) and that this has implications for #30

@daggaz
Copy link
Owner

daggaz commented Jun 16, 2023

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)

If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

@RamiAwar
Copy link
Author

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)

If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

I like that yeah! Thanks!

@daggaz daggaz changed the title Question: Is there support for streaming string values? Add option to receive strings as file-like streams Jun 26, 2023
This was referenced Jun 26, 2023
@daggaz
Copy link
Owner

daggaz commented Jun 26, 2023

I've added the readline() support (and hence the iterator support).

@daggaz
Copy link
Owner

daggaz commented Jun 26, 2023

@smheidrich the testswill not pass on this branch until the rust tokenizer is implemented

@daggaz
Copy link
Owner

daggaz commented Jun 26, 2023

@smheidrich I'm considering the .completed part of the API optional. If you don't want to implement it, I can strip it out of the tests

@daggaz
Copy link
Owner

daggaz commented Jun 26, 2023

Also, totally open to ideas/suggestions/improvements!

@daggaz
Copy link
Owner

daggaz commented Jun 30, 2023

@smheidrich I've just updated the two PRs with some changes.

I've added a buffering argument with similar semantics to the open() method. This allows the end user to control the buffer size and in particular, pass 0 to disable buffering entirely (this led to the same issue as described in #49).

Of course, this causes everything to crash with TypeError: RustTokenizer.__new__() got an unexpected keyword argument 'buffering'.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

@smheidrich
Copy link
Contributor

smheidrich commented Jun 30, 2023

@daggaz Sorry, I was pretty busy this week and have only started to wrap my head around the changes.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

I'm not sure that would be a good idea, swallowing potential errors silently etc...

If the goal is to be able to release this feature and others like it without waiting for the Rust tokenizer to have it too, wouldn't it make more sense to implement a switch in json-stream's tokenizer selection logic that looks at whether buffering or strings_as_files were requested? Right now, default_tokenizer is just set to one of the two tokenizers, but an alternative idea would be to make default_tokenizer itself a function that decides based on the supplied arguments which tokenizer to return. Then once the features have been implemented in the Rust tokenizer, the selection logic can change accordingly.

@daggaz
Copy link
Owner

daggaz commented Jun 30, 2023

Oh, absolutely! That's a great plan.

@daggaz
Copy link
Owner

daggaz commented Jul 2, 2023

I was thinking about this a bit further, and thought that maybe it should be up to the version of the rust tokenizer to decide if is supports x or y.

Perhaps rust_tokenizer_or_raise() could take a list of feature flags, and only return a tokenizer if it supports all the flags?

That way the python code doesn't need to understand the state of the rust tokenizer's feature support at all, especially if for example a person doesn't have the latest version?

@smheidrich
Copy link
Contributor

smheidrich commented Jul 2, 2023

@daggaz Sure, that's not a problem from my end, and I can see the advantage that it won't require updates to json-stream which only bump the version of the dependency to json-stream-rs-tokenizer and change the selection logic. But the disadvantage is that then you always have to wait for me to add such flags to rust_tokenizer_or_raise when you introduce a new feature that I'm lagging behind on, whereas if it's all decided on the json-stream end, you can do what you want.

But I guess it's not that imporant since such cases will probably be rare in the future so I'll just merge it for now unless you have other suggestions for what the API of that call should look like?

@daggaz
Copy link
Owner

daggaz commented Jul 2, 2023

Ah, no, that's not quite what I meant, I mean that you take the list of requested feature flags (or just the kwargs that control them) and raise if there are any you don't support/understand.

This also means that you could support things ahead of the python tokenizer.

@smheidrich
Copy link
Contributor

Ohh I see, that makes more sense 👍 I've updated the PR

@daggaz
Copy link
Owner

daggaz commented Jul 5, 2023

My changes to run all the tests with both tokenizers has highlighted a bug with this PR.

Working on a fix!

@nacho00112
Copy link

@daggaz sorry for the tag, but I need this feature, for now my workaround is
splitting my strings in a list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants