Add option to receive strings as file-like streams #45

RamiAwar · 2023-06-03T13:52:26Z

I noticed this in the future work section: supporting long string values as streams themselves. Is this already implemented by any chance?

If not, do you have a plan on how you're going to implement it? Maybe you can point me in the right direction and I can give it a try.

My use-case is streaming chunks of a json string and reading the key values in a "persistent" way as described in your docs, but also partial values. For example if I were to receive {"name": "John Doe"} then I first receive {"name": "J then ohn Doe"},
and I want to be able to iterate over it in Python like:

for x in response["name"]:
    print(x)

# J
# ohn Doe

Or some similar API that achieves the same thing.

The text was updated successfully, but these errors were encountered:

daggaz · 2023-06-09T17:10:45Z

Hey, thanks for the issue.

If you're looking for something that's going to stream chunks as they are received exactly as it was chunked in network packets, then this isn't going to be the library for you.

However, assuming what you really want is for partially transmitted string data (that is embedded in json) to be made available as soon as possible, then something might be able to be done, as per my "future work" idea.

I think the best API would be if a file-like object was returned in place of the string. You can then read this like any file, and it will EOF when the particular json string is completed.

The API would be something like:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
while s := long_string.read(1024):
    print(s)

I will think about this...

daggaz · 2023-06-09T17:12:23Z

@smheidrich I feel this this might end up in significant changes to the python tokenizer, which implies work for you, so I'll only proceed if you think you can support this in the rust tokenizer too.

smheidrich · 2023-06-09T18:13:05Z

@daggaz Sure, that looks like it should be possible 👍

daggaz · 2023-06-14T22:56:56Z

In pursuance of this, I have just created two PRs.

The first, #46, changes the tokenizer to read the stream into a buffer instead of reading a character at a time

The second builds on this buffered to allow strings to be return as file-like objects, basically implementing the API above.

@smheidrich there are some pretty big changes here so we can only proceed if you agree

smheidrich · 2023-06-15T20:24:53Z

@daggaz I'll have a look as soon as possible and try to write an analogous MR, but I have to do some other stuff first (namely fix the build and align the Unicode surrogate pair code with #42), so it will take a while.

RamiAwar · 2023-06-16T11:47:07Z

Thanks for the quick response! A file-like will get the job done for me. Much cleaner than the quick implementation I went for.

Would it be possible to also wrap this file-like obj with a generator? Then it's even easier to consume. Do you see any downsides with that vs the file-like read?

daggaz · 2023-06-16T12:07:41Z

@smheidrich it occurs to me that this option should only be available for transient streams (or...we make them seekable in that case?) and that this has implications for #30

daggaz · 2023-06-16T12:19:55Z

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:

data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)

If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

RamiAwar · 2023-06-18T07:44:23Z

@RamiAwar line-based file-like objects are already iterable, but JsonStreamReader is missing the required readline() implementation. I will update to add this. The API would then be:
data = json_stream.load(f, strings_as_files=True)
long_string = data['name']
for line in long_string:
    print(line)
If your data is not line based, obviously this doesn't work, in which case I think having access to the full file API is best.

I like that yeah! Thanks!

daggaz · 2023-06-26T22:03:17Z

I've added the readline() support (and hence the iterator support).

daggaz · 2023-06-26T22:04:00Z

@smheidrich the testswill not pass on this branch until the rust tokenizer is implemented

daggaz · 2023-06-26T22:07:55Z

@smheidrich I'm considering the .completed part of the API optional. If you don't want to implement it, I can strip it out of the tests

daggaz · 2023-06-26T22:08:15Z

Also, totally open to ideas/suggestions/improvements!

daggaz · 2023-06-30T09:26:29Z

@smheidrich I've just updated the two PRs with some changes.

I've added a buffering argument with similar semantics to the open() method. This allows the end user to control the buffer size and in particular, pass 0 to disable buffering entirely (this led to the same issue as described in #49).

Of course, this causes everything to crash with TypeError: RustTokenizer.__new__() got an unexpected keyword argument 'buffering'.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

smheidrich · 2023-06-30T20:22:32Z

@daggaz Sorry, I was pretty busy this week and have only started to wrap my head around the changes.

It's a bit of a shame that the python tokenizer can't move forward (adding new args to control new behaviour) without upsetting the rust tokenizer.

Perhaps it can be made to ignore unknown args?

I'm not sure that would be a good idea, swallowing potential errors silently etc...

If the goal is to be able to release this feature and others like it without waiting for the Rust tokenizer to have it too, wouldn't it make more sense to implement a switch in json-stream's tokenizer selection logic that looks at whether buffering or strings_as_files were requested? Right now, default_tokenizer is just set to one of the two tokenizers, but an alternative idea would be to make default_tokenizer itself a function that decides based on the supplied arguments which tokenizer to return. Then once the features have been implemented in the Rust tokenizer, the selection logic can change accordingly.

daggaz · 2023-06-30T21:50:26Z

Oh, absolutely! That's a great plan.

daggaz · 2023-07-02T16:42:26Z

I was thinking about this a bit further, and thought that maybe it should be up to the version of the rust tokenizer to decide if is supports x or y.

Perhaps rust_tokenizer_or_raise() could take a list of feature flags, and only return a tokenizer if it supports all the flags?

That way the python code doesn't need to understand the state of the rust tokenizer's feature support at all, especially if for example a person doesn't have the latest version?

smheidrich · 2023-07-02T19:51:23Z

@daggaz Sure, that's not a problem from my end, and I can see the advantage that it won't require updates to json-stream which only bump the version of the dependency to json-stream-rs-tokenizer and change the selection logic. But the disadvantage is that then you always have to wait for me to add such flags to rust_tokenizer_or_raise when you introduce a new feature that I'm lagging behind on, whereas if it's all decided on the json-stream end, you can do what you want.

But I guess it's not that imporant since such cases will probably be rare in the future so I'll just merge it for now unless you have other suggestions for what the API of that call should look like?

daggaz · 2023-07-02T20:14:35Z

Ah, no, that's not quite what I meant, I mean that you take the list of requested feature flags (or just the kwargs that control them) and raise if there are any you don't support/understand.

This also means that you could support things ahead of the python tokenizer.

smheidrich · 2023-07-02T20:55:11Z

Ohh I see, that makes more sense 👍 I've updated the PR

daggaz · 2023-07-05T21:15:44Z

My changes to run all the tests with both tokenizers has highlighted a bug with this PR.

Working on a fix!

nacho00112 · 2023-12-23T17:48:13Z

@daggaz sorry for the tag, but I need this feature, for now my workaround is
splitting my strings in a list.

daggaz changed the title ~~Question: Is there support for streaming string values?~~ Add option to receive strings as file-like streams Jun 26, 2023

This was referenced Jun 26, 2023

added buffered reading to tokenizer #46

Open

String streams #47

Open

smheidrich mentioned this issue Jul 2, 2023

Add feature flags: buffering & strings as files smheidrich/py-json-stream-rs-tokenizer#85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to receive strings as file-like streams #45

Add option to receive strings as file-like streams #45

RamiAwar commented Jun 3, 2023 •

edited

Loading

daggaz commented Jun 9, 2023

daggaz commented Jun 9, 2023

smheidrich commented Jun 9, 2023

daggaz commented Jun 14, 2023

smheidrich commented Jun 15, 2023

RamiAwar commented Jun 16, 2023 •

edited

Loading

daggaz commented Jun 16, 2023

daggaz commented Jun 16, 2023

RamiAwar commented Jun 18, 2023

daggaz commented Jun 26, 2023

daggaz commented Jun 26, 2023 •

edited

Loading

daggaz commented Jun 26, 2023

daggaz commented Jun 26, 2023

daggaz commented Jun 30, 2023

smheidrich commented Jun 30, 2023 •

edited

Loading

daggaz commented Jun 30, 2023

daggaz commented Jul 2, 2023

smheidrich commented Jul 2, 2023 •

edited

Loading

daggaz commented Jul 2, 2023

smheidrich commented Jul 2, 2023

daggaz commented Jul 5, 2023

nacho00112 commented Dec 23, 2023

Add option to receive strings as file-like streams #45

Add option to receive strings as file-like streams #45

Comments

RamiAwar commented Jun 3, 2023 • edited Loading

daggaz commented Jun 9, 2023

daggaz commented Jun 9, 2023

smheidrich commented Jun 9, 2023

daggaz commented Jun 14, 2023

smheidrich commented Jun 15, 2023

RamiAwar commented Jun 16, 2023 • edited Loading

daggaz commented Jun 16, 2023

daggaz commented Jun 16, 2023

RamiAwar commented Jun 18, 2023

daggaz commented Jun 26, 2023

daggaz commented Jun 26, 2023 • edited Loading

daggaz commented Jun 26, 2023

daggaz commented Jun 26, 2023

daggaz commented Jun 30, 2023

smheidrich commented Jun 30, 2023 • edited Loading

daggaz commented Jun 30, 2023

daggaz commented Jul 2, 2023

smheidrich commented Jul 2, 2023 • edited Loading

daggaz commented Jul 2, 2023

smheidrich commented Jul 2, 2023

daggaz commented Jul 5, 2023

nacho00112 commented Dec 23, 2023

RamiAwar commented Jun 3, 2023 •

edited

Loading

RamiAwar commented Jun 16, 2023 •

edited

Loading

daggaz commented Jun 26, 2023 •

edited

Loading

smheidrich commented Jun 30, 2023 •

edited

Loading

smheidrich commented Jul 2, 2023 •

edited

Loading