Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lapis is using the wrong request uri from ngx #706

Open
karai17 opened this issue Oct 1, 2020 · 5 comments
Open

Lapis is using the wrong request uri from ngx #706

karai17 opened this issue Oct 1, 2020 · 5 comments

Comments

@karai17
Copy link

karai17 commented Oct 1, 2020

lapis/lapis/nginx.moon

Lines 77 to 89 in d3a16d1

parsed_url: (t) ->
uri = ngx.var.request_uri
pos = uri\find("?")
uri = pos and uri\sub(1, pos-1) or uri
host_header = ngx.var.http_host
{
scheme: ngx.var.scheme
path: uri
host: ngx.var.host
port: host_header and host_header\match ":(%d+)$"
query: ngx.var.args
}

Lapis is checking ngx.var.request_uri for the request's uri, but this variable only points to the main request and uses a raw uri string which has some potential downsides (see below SO article). Instead, Lapis should use ngx.var.uri which is the current request uri, including any sub-requests. It is also normalized so any funky issues with the uri, including any encoding, is removed.

As of right now, using ngx.location.capture() to make sub-requests is wholly broken and unusable as it just loops the main request until the server throws a 500 error.

https://stackoverflow.com/questions/48708361/nginx-request-uri-vs-uri

@leafo
Copy link
Owner

leafo commented Oct 1, 2020

I seem to remember request_uri being chosen intentionally here, I don't think I anticipated the issue with using capture on the same app though.

@karai17
Copy link
Author

karai17 commented Oct 2, 2020

From what that SO article notes, it would seem that the two major differences between request_uri and uri are that request_uri will contain query strings whereas uri does not (the code following the variable assignment immediately cuts that off anyway) and that uri is normalized, which is probably more useful than the raw request string for almost every case.

@karai17
Copy link
Author

karai17 commented Oct 2, 2020

Doing a bit more research, it appears that nginx as a whole prefers normalized URIs, citing security reasons among others. request_uri seems to be provided for cases such as redirecting rather than direct use, the opposite of Lapis' current behaviour.

While this would be a breaking change, I think switching to uri is the correct and expected behaviour here, regardless of the sub-request case. I think it is also fair to say that any form of user input, whether it be a web address or form data is subject to scrutiny and /asdf///////%20qwerty being rewritten as /asdf/ qwerty is the expected result for 99.99% of applications. For applications that want explicit/horrible routes, there are nginx configuration variables that allow you to disable the normaization of uri.

@karai17
Copy link
Author

karai17 commented Oct 6, 2020

Going beyond nginx and reading the specification for URIs (https://tools.ietf.org/html/rfc3986) in general, I think there has been some misunderstandings on what a URI actually is based on our conversations on Discord over the past few days.

What is a URI?

https://tools.ietf.org/html/rfc3986#section-1

   A Uniform Resource Identifier (URI) provides a simple and extensible
   means for identifying a resource.  This specification of URI syntax
   and semantics is derived from concepts introduced by the World Wide
   Web global information initiative, whose use of these identifiers
   dates from 1990 and is described in "Universal Resource Identifiers
   in WWW" [RFC1630].

The spec defines a URI as a "simple and extensible means of identifying a resource", also noting that it has specific syntax and semantics that must be followed. Because it has syntax and semantics, a URI is not just a [unique] string such as the key in a key/value pair, but an actual path to a resource, relative or absolute.

https://tools.ietf.org/html/rfc3986#section-3

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment

Most of what I'll be writing about and quoting will be about the path section of a fully qualified URI, as that seems to be the area of contention.

What is a URI's Path?

https://tools.ietf.org/html/rfc3986#section-3.3

   The path component contains data, usually organized in hierarchical
   form, that, along with data in the non-hierarchical query component
   (Section 3.4), serves to identify a resource within the scope of the
   URI's scheme and naming authority (if any).

This definition notes specifically that a URI's path is usually organized in a "hierarchical form". This is analoguous to (but not necessarily equivalent to) a structured filesystem. Each segment in a path is typically though of as a nested layer of directories where the left-most segment encapsulates all segments right of it. Similar to a filesystem, we'd expect something akin to:

example.com
   /path
      /to
         /my
            file.html

If we look at a typical blog-style path, we may see something like /articles/2020/10/06/this-is-my-article where each segment encapsulates all the segments to the right of itself. All articles are namespaced in the articles segment, and that can be broken down by years, and each of those years broken down by months, and each of those months by days, and each of those days will contain the articles.

   The path segments "." and "..", also known as dot-segments, are
   defined for relative reference within the path name hierarchy.  They
   are intended for use at the beginning of a relative-path reference
   (Section 4.2) to indicate relative position within the hierarchical
   tree of names.  This is similar to their role within some operating
   systems' file directory structures to indicate the current directory
   and parent directory, respectively.  However, unlike in a file
   system, these dot-segments are only interpreted within the URI path
   hierarchy and are removed as part of the resolution process (Section
   5.2).

Relative path segments such as . and .. are required to be correctly resolved before being processed. Unlike a filesystem, these are meant to be resolved within the confines of the relative URI and cannot break out of the base URI.

URI Equivalency

A significant part of our discussion was based around the notion that one URI should point to one resource, and aliases should not exist at all. This isn't necessarily wrong, but I believe the misunderstanding here is focused on what constitutes an alias.

https://tools.ietf.org/html/rfc3986#section-6.1

   In testing for equivalence, applications should not directly compare
   relative references; the references should be converted to their
   respective target URIs before comparison.

https://tools.ietf.org/html/rfc3986#section-6.2.1

   False negatives are caused by the production and use of URI aliases.
   Unnecessary aliases can be reduced, regardless of the comparison
   method, by consistently providing URI references in an already-
   normalized form (i.e., a form identical to what would be produced
   after normalization is applied, as described below).

https://tools.ietf.org/html/rfc3986#section-6.2.2

   Implementations may use logic based on the definitions provided by
   this specification to reduce the probability of false negatives.
   This processing is moderately higher in cost than character-for-
   character string comparison.  For example, an application using this
   approach could reasonably consider the following two URIs equivalent:

      example://a/b/c/%7Bfoo%7D
      eXAMPLE://a/./b/../b/%63/%7bfoo%7d

   Web user agents, such as browsers, typically apply this type of URI
   normalization when determining whether a cached response is
   available.  Syntax-based normalization includes such techniques as
   case normalization, percent-encoding normalization, and removal of
   dot-segments.

https://tools.ietf.org/html/rfc3986#section-6.2.2.2

   The percent-encoding mechanism (Section 2.1) is a frequent source of
   variance among otherwise identical URIs.  In addition to the case
   normalization issue noted above, some URI producers percent-encode
   octets that do not require percent-encoding, resulting in URIs that
   are equivalent to their non-encoded counterparts.  These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

https://tools.ietf.org/html/rfc3986#section-6.2.2.3

   The complete path segments "." and ".." are intended only for use
   within relative references (Section 4.1) and are removed as part of
   the reference resolution process (Section 5.2).  However, some
   deployed implementations incorrectly assume that reference resolution
   is not necessary when the reference is already a URI and thus fail to
   remove dot-segments when they occur in non-relative paths.  URI
   normalizers should remove dot-segments by applying the
   remove_dot_segments algorithm to the path, as described in
   Section 5.2.4.

https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Generic_syntax

A path component, consisting of a sequence of path segments separated by a slash (/). A path is always defined for a URI, though the defined path may be empty (zero length). A segment may also be empty, resulting in two consecutive slashes (//) in the path component.

All of the rules above talk about URI references and normalization. URI references, as far as I can determine, are the raw URIs that come into a system and must be normalized to produce the URI target. This normalization process should be applied to every URI reference at its destination. We can think of this similar to encoding/decoding JSON: the requesting client encodes JSON as the final step before sending the request, and the responding server decodes it as the first step after it is received. Because of this, reference URIs are not aliases, they are just raw data waiting to be normalized. It is only after normalization that we can determine if they are aliases.

Though I cannot find it specifically in the spec document, Wikipedia states that a path segment can be empty and thus many slashes can be adjacent to each other. Because of the semantics of path segments, these segments provide no information, hierarchical or otherwise, and can safely be compressed down to a single slash.

For URI normalization, the correct steps are as follows:

  1. Compress empty segments
  2. Resolve any dot-segments that may exist
  3. Decode percent-encoded data

Note that these are the rules that are applied to the ngx.var.uri variable.

@bungle
Copy link

bungle commented Mar 16, 2021

ngx.var.uri normalization is stupid and wrong. It does not take in account the reserved characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants