-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to receive strings as file-like streams #45
Comments
Hey, thanks for the issue. If you're looking for something that's going to stream chunks as they are received exactly as it was chunked in network packets, then this isn't going to be the library for you. However, assuming what you really want is for partially transmitted string data (that is embedded in json) to be made available as soon as possible, then something might be able to be done, as per my "future work" idea. I think the best API would be if a file-like object was returned in place of the string. You can then read this like any file, and it will EOF when the particular json string is completed. The API would be something like: data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
while s := long_string.read(1024):
print(s) I will think about this... |
@smheidrich I feel this this might end up in significant changes to the python tokenizer, which implies work for you, so I'll only proceed if you think you can support this in the rust tokenizer too. |
@daggaz Sure, that looks like it should be possible 👍 |
In pursuance of this, I have just created two PRs. The first, #46, changes the tokenizer to read the stream into a buffer instead of reading a character at a time The second builds on this buffered to allow strings to be return as file-like objects, basically implementing the API above. @smheidrich there are some pretty big changes here so we can only proceed if you agree |
@daggaz I'll have a look as soon as possible and try to write an analogous MR, but I have to do some other stuff first (namely fix the build and align the Unicode surrogate pair code with #42), so it will take a while. |
Thanks for the quick response! A file-like will get the job done for me. Much cleaner than the quick implementation I went for. Would it be possible to also wrap this file-like obj with a generator? Then it's even easier to consume. Do you see any downsides with that vs the file-like read? |
@smheidrich it occurs to me that this option should only be available for transient streams (or...we make them seekable in that case?) and that this has implications for #30 |
@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be: data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
print(line) If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best. |
I like that yeah! Thanks! |
I've added the readline() support (and hence the iterator support). |
@smheidrich the testswill not pass on this branch until the rust tokenizer is implemented |
@smheidrich I'm considering the |
Also, totally open to ideas/suggestions/improvements! |
@smheidrich I've just updated the two PRs with some changes. I've added a Of course, this causes everything to crash with It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer. Perhaps it can be made to ignore unknown args? |
@daggaz Sorry, I was pretty busy this week and have only started to wrap my head around the changes.
I'm not sure that would be a good idea, swallowing potential errors silently etc... If the goal is to be able to release this feature and others like it without waiting for the Rust tokenizer to have it too, wouldn't it make more sense to implement a switch in json-stream's tokenizer selection logic that looks at whether |
Oh, absolutely! That's a great plan. |
I was thinking about this a bit further, and thought that maybe it should be up to the version of the rust tokenizer to decide if is supports x or y. Perhaps That way the python code doesn't need to understand the state of the rust tokenizer's feature support at all, especially if for example a person doesn't have the latest version? |
@daggaz Sure, that's not a problem from my end, and I can see the advantage that it won't require updates to But I guess it's not that imporant since such cases will probably be rare in the future so I'll just merge it for now unless you have other suggestions for what the API of that call should look like? |
Ah, no, that's not quite what I meant, I mean that you take the list of requested feature flags (or just the kwargs that control them) and raise if there are any you don't support/understand. This also means that you could support things ahead of the python tokenizer. |
Ohh I see, that makes more sense 👍 I've updated the PR |
My changes to run all the tests with both tokenizers has highlighted a bug with this PR. Working on a fix! |
@daggaz sorry for the tag, but I need this feature, for now my workaround is |
I noticed this in the future work section: supporting long string values as streams themselves. Is this already implemented by any chance?
If not, do you have a plan on how you're going to implement it? Maybe you can point me in the right direction and I can give it a try.
My use-case is streaming chunks of a json string and reading the key values in a "persistent" way as described in your docs, but also partial values. For example if I were to receive
{"name": "John Doe"}
then I first receive{"name": "J
thenohn Doe"}
,and I want to be able to iterate over it in Python like:
Or some similar API that achieves the same thing.
The text was updated successfully, but these errors were encountered: