Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URI encoding in Content State #2292

Open
azaroth42 opened this issue Apr 24, 2024 · 5 comments
Open

URI encoding in Content State #2292

azaroth42 opened this issue Apr 24, 2024 · 5 comments
Labels
content-state Issue relates to the content state API discuss normative

Comments

@azaroth42
Copy link
Member

The current 1.0 spec is non-deterministic about embedded URI decoding in the plain URI case.

The encoding requirement is:

When the content state is a plain URI, rather than a JSON object, it MUST NOT be content-state-encoded.

However it doesn't give an algorithm as how to interpret the URIs embedded in the query parameter in terms of URI encoding.

The previous decision for the proposed spec was to encode them (cf discovery#90) and then was changed to URI encode them after TRC feedback, but that URI encoding requirement never seems to have made it into the spec.

Currently I believe it is non-deterministic as to how a client should interpret the payload: iiif-content=https://example.org/manifest/%2550 -- is it %50 (1 encoding) or P (2 encodings) or %2550 (0 encodings) on the end?

@zimeon
Copy link
Member

zimeon commented Apr 24, 2024

I don't think there is any uncertainty about number of encoding/decoding actions. Per https://datatracker.ietf.org/doc/html/rfc3986#section-2.4 encoding SHOULD be done once when the URI is assembled from its component parts, and decoding MUST be done once when the URI is disassembled into its component parts (see also https://datatracker.ietf.org/doc/html/rfc3986#section-7.3). Attempts to do encoding/decoding separately from the URI assembly/disassembly steps will, in the general case, lead to errors.

Let's work though an example: If I build a URI from host1, path1 and query1 (leaving out fragment, assuming https scheme), I create a URI1 = https://ENC(host1)/ENC(path1)?ENC(query1) where ENC is my encoding function (omitting for now that ENC may be different for different components because they have different allowed characters).

Now say that I then build another URI that references URI1 in its query, with components host2, path2 and query iiif-content=URI1. I use the same procedure and get UR2 = https://ENC(host2)/ENC(path2)?iiif-content=ENC(https://ENC(host1)/ENC(path1)?ENC(query1)) -- there is double encoding of the embedded URI1 components and there must be double decoding when we reverse the two assembly processes.

Thus, to take Rob's example, if I am given a Content State URL:

https://example.com/viewer?iiif-content=https://example.org/manifest/%2550

this means (decode as we split) https://example.com/viewer (no change on decoding) should open https://example.org/manifest/%50 (%25 -> % on decode) which will then be interpreted as https//example.org/manifest/P (%50 -> P on decode) when split into its components.

(In this case we could just as well have used https://example.com/viewer?iiif-content=https://example.org/manifest/P or https://example.com/viewer?iiif-content=https://example.org/manifest/%50 because the P (like the / and . in the manifest URI) didn't need encoding)

@zimeon
Copy link
Member

zimeon commented Apr 24, 2024

To put the above another way: We don't get a choice of specifying when something should or should not be URI encoded, it is already part of the URI spec (albeit with great latitude allowing much more than the bare minimum to be encoded if one chooses). We can add helpful reminders though.

@azaroth42
Copy link
Member Author

azaroth42 commented Apr 25, 2024

To work a more normal case (I chose %2550 as an easy one to read): a manifest and viewer that themselves have query params.

So, the manifest's URI is: https://example.org/builder?id=1&fmt=iiif
The viewer's URI is https://example.com/viewer?fmt=iiif

We construct the parameter first:
https://ENC(example.org)/ENC(builder)?ENC(id=1&fmt=iiif)
--> https://example.org/builder?id=1%26fmt=iiif

(the & is part of the query part of the http URI syntax, so is encoded by the ? is part of the definition, and so is not, per https://datatracker.ietf.org/doc/html/rfc3986#section-3)

Then we add the query param iiif-content to the viewer URI to get
https://example.com/viewer?fmt=iiif&iiif-content=

We then add the encoded param:
https://example.com/viewer?fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif

And encode:

https://ENC(example.com)/ENC(viewer)?ENC(fmt=iiif&iiif-content=https://example.org/builder?id=1%26fmt=iiif)

with a result of

https://example.com/viewer?fmt=iiif%26iiif-content=https://example.org/builder%3Fid=1%2526fmt=iiif

Assuming that =, : and / are all safe, but ? and & are not.

If that's correct, and even if we're just following the rules, I think having that spelled out in no uncertain terms is necessary for folks to end up at the same conclusion.

@zimeon
Copy link
Member

zimeon commented Apr 26, 2024

@azaroth42 - I agree with the above except the encoding of & as %26 when building the manifest URI. I was sloppy when I suggested ENC(query) above, sorry!

This is not part of RFC3986 URI syntax requirements -- they generally would allow & to be unencoded and usually encoded would also be OK (I copied the relevant BNF syntax rules in IIIF/trc#125 (comment), & is in sub-delims).

However, the key1=value1&key2=value2 format for the query makes a distinction between & and = that might be in keys and values, and instances where they are part of the structure. I think this is defined by https://url.spec.whatwg.org/#application/x-www-form-urlencoded, that is what Python3's urllib.parse references). The decoding algorithm has: split on & then for each chunk split on first = and then decode/unquote the resulting key and value strings.

>>> import urllib
>>> urllib.parse.parse_qsl('id=1%26fmt=iiif')    # broken
[('id', '1&fmt=iiif')]
>>> urllib.parse.parse_qsl('id=1&fmt=iiif')      # good
[('id', '1'), ('fmt', 'iiif')]
>>> urllib.parse.parse_qsl('ke%3Dy1=val%3Due1&ke%26y2=val%26ue2&key3=second=unencoded') # complex!
[('ke=y1', 'val=ue1'), ('ke&y2', 'val&ue2'), ('key3', 'second=unencoded')]

To rerun your example:

  • Manifest URI encodes to: https://example.org/builder?id=1&fmt=iiif (no %)
  • Build viewer URI as https://example.com/viewer?fmt=iiif&iiif-content= + ENC(manifest-uri)
  • Where an almost minimal encoding would be https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif
>>> import urllib
>>> p = urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif')
>>> p
SplitResult(scheme='https', netloc='example.com', path='/viewer', query='fmt=iiif&iiif-content=https://example.com/builder?id%261%3Dfmt%26iiif', fragment='')
>>> urllib.parse.parse_qsl(p.query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id&1=fmt&iiif')]

(for fully minimal encoding, which I don't recommend, the additional = in the manifest URI don't need to be encoded because the split is on the first occurrence...

>>> urllib.parse.parse_qsl(urllib.parse.urlsplit('https://example.com/viewer?fmt=iiif&iiif-content=https://example.com/builder?id=1%3Dfmt=iiif').query)
[('fmt', 'iiif'), ('iiif-content', 'https://example.com/builder?id=1=fmt=iiif')]

)

@azaroth42
Copy link
Member Author

I'm convinced :) I think the above recipe should be published, preferably in the next version of the spec, and before then as a cookbook recipe.

@azaroth42 azaroth42 added content-state Issue relates to the content state API and removed discovery Issue relates to the Change Discovery API labels Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-state Issue relates to the content state API discuss normative
Projects
None yet
Development

No branches or pull requests

2 participants