Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access to strings by reference #206

Closed
petm-dec opened this issue Mar 16, 2021 · 12 comments
Closed

Access to strings by reference #206

petm-dec opened this issue Mar 16, 2021 · 12 comments

Comments

@petm-dec
Copy link

petm-dec commented Mar 16, 2021

Hello, I created #205 as a proposal to allow direct access to the underlying buffer for (large) strings.
We use this to avoid an extra copy when processing a stream of cbor encocded images + metadata.

We use only determinate length strings for these, I have not implemented iteration over the
parts of intedeterminate length strings, as I was not sure whether to use enter/leave for those, or how this
would best fit into the existing API.

@thiagomacieira
Copy link
Member

I'm not going to accept your PR. Instead, I'll merge my own implementation, which supports chunked strings too.

@petm-dec
Copy link
Author

Ah that looks even better with the chunked string support, thanks!

@thiagomacieira
Copy link
Member

Please take a look at the API I developed. I still think it's clumsy, so I'd love feedback.

CBOR_INLINE_API CborError cbor_value_get_string_chunk_size(const CborValue *value, size_t *len)
CBOR_INLINE_API bool cbor_value_string_iteration_at_end(const CborValue *value)
CBOR_INLINE_API CborError cbor_value_begin_string_iteration(CborValue *value)
CBOR_INLINE_API CborError cbor_value_finish_string_iteration(CborValue *value)
CBOR_INLINE_API CborError cbor_value_get_text_string_chunk(const CborValue *value, const char **bufferptr,
                                                           size_t *len, CborValue *next)
CBOR_INLINE_API CborError cbor_value_get_byte_string_chunk(const CborValue *value, const uint8_t **bufferptr,
                                                           size_t *len, CborValue *next)

API docs: https://github.com/thiagomacieira/tinycbor/blob/dev/src/cborparser.c#L1068-L1147 & https://github.com/thiagomacieira/tinycbor/blob/dev/src/cborparser.c#L1231-L1295

If you have time, the ugliest part of the patch set is the cbor_parser_init_reader part and feedback there is appreciated too.

@petm-dec
Copy link
Author

I have already ported my code to this api, it is exactly as expected.
It seems in line with the rest of the api style and don't see what is clumsy about it.

I used the cbor_value_string_iteration_at_end() to verify that, in our case, we are at the end after reading exactly 1 chunk (required in our case at the moment).

The implementation seems straightforward to me, but the internals are not too familiar for me.

cbor_parser_init_reader: I see the need if you receive your messages in pieces and have little buffer space.
We receive complete messages from zeromq and wanted to avoid an additional copy so it is no issue for us.

But it looks ok to me if you would want to provide token-by-token input instead of the complete message.
No better ideas come to mind to solve that.

@kalcutter
Copy link
Contributor

Wouldn't names like cbor_value_enter_string and CBOR_value_leave_string be more consistent with the rest of the API?

@kalcutter
Copy link
Contributor

Also the functions cbor_value_get_string_chunk_size and cbor_value_string_iteration_at_end aren't needed right? Wouldn't it be cleaner if cbor_value_at_end and cbor_value_get_string_length worked on the chunk level after cbor_value_enter_string is called?

@kalcutter
Copy link
Contributor

I guess cbor_value_get_string_chunk_size would also be useful if you want to have the new init function cbor_parser_init_reader... In this case, cbor_value_get_string_length would return the length of the definite string piece on the CBOR level, while cbor_value_get_string_chunk_size would return the size of the next chunk which could be shorter than that if the reader returned something shorter.

@thiagomacieira
Copy link
Member

Yes, enter and leave string might make sense.

The reason cbor_value_at_end doesn't work is because the element counter remains the same throughout the string iteration, as a chunked string is one single element in a map or array. This is because I didn't make the API like enter or leave -- I don't remember why. I can investigate it again to see if it makes sense to do that, in which case we can reuse the same function. Then we need to figure out a way to make sure you can't accidentally decode invalid stream: unlike a map or array, a chunked string cannot contain elements of any type nor can it contain indeterminate length nested strings, so the normal iteration API cannot work. That is, somehow the API needs to find a way to prevent you from accidentally doing:

    int v;
    CborValue str;
    cbor_value_enter_string(&it, &str);
    cbor_value_get_integer(&str, &v);

One way I had considered (but didn't implement) was to have a separate type, a CborStringValue, in which case you can't accidentally pass this iterator into the regular decoders. But then you can't pass it to cbor_value_at_end() & cbor_value_get_string_length either nor to _cbor_value_copy_string / dup_string. This would have also made the implementation in QCborStreamReader slightly more complex, for the same reason: it operates on a stack of CborValue.

But there may be another way: CborValue stores the current element type, so that you can't call get_integer on a string and the string chunk iterating functions could also enforce that next element type is a string of the same type and with explicit length. Or if they see a break byte, they set remaining = 0 such that at_end() returns true.

That sounds like a good plan. Now to have the time to explore it...

@thiagomacieira
Copy link
Member

cbor_value_get_string_length is implemented and documented as only returning the length of fixed-length strings, the same way that cbor_value_get_array_length and cbor_value_get_array_length are. If you want the length of a chunked string, you have to (currently) use cbor_value_calculate_string_length or, with the new API, iterate yourself over the chunks.

@kalcutter
Copy link
Contributor

I think using CborValue would be nicer than making a new type only for this case (e.g. CborStringValue). As you hinted at, I think errors/API misuse can be handled in the library with a flag (to indicate a string has been entered). This should work since strings cannot be entered recursively.

Since an indefinite string is made up of a series of fixed-length strings, I think it would be appropriate ifcbor_value_get_string_length returned the length of the current fixed length substring of the entered string. Note, however, that this is not necessarily the same as the current chunk length if cbor_parser_init_reader is used.

@kalcutter
Copy link
Contributor

Hi @thiagomacieira. Is there any reason not to already merge your pending changes at least up to commit 98cb87c ?

thiagomacieira@98cb87c

@thiagomacieira
Copy link
Member

Not really. I will probably merge everything and then design the new API modifying what is in there. The WIP commits are about the delegated streaming API, not the string API anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants