Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/how-to-parse.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ To parse a `purl` string in its components:
- Discard any empty segment from that split
- Percent-decode each segment
- UTF-8-decode each segment if needed in your programming language
- Report an error if any segment contains a slash `/`
Copy link
Contributor

@sjn sjn Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity - what is the purpose of this rule, and for reporting an error?

I'm thinking of a couple of things when I read this.

  • Is this within the spirit of Postel's Law?
  • Should it also require the parsing to stop/die/break/exit/produce an exception? ("Stop the parsing and report an error if any segment contains a /")
  • Are there other bytes that should be illegal? E.g. disallow %00 (the null byte), since this also illegal in filenames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to report an error here is to protect consumers of PURLs from malformed or malicious input. By error I mean whatever the parser uses to signal a problem (exception, exit status, etc.).

Historically, many vulnerabilities in HTTP servers came from path traversal attacks where characters like . or / were smuggled in through alternative encodings (e.g. . as %2E, %C0%AE, %E0%80%AE, %F0%80%80%AE; / as %2F, %C0%AF, %E0%80%AF, %F0%80%80%AF). Allowing these in PURLs would create ambiguity and open the door to similar exploits.

In the PURL spec today:

  • Empty segments (//) are not meaningful, but are often an honest mistake in producers. The current parser recommendation is to normalize them away rather than fail.
  • Slash in a segment (/) is never valid. Since the parse process splits on / before decoding, any literal / inside a segment must have been hidden behind percent-encoding, which is a strong signal of an attempt to “escape” the namespace. In this case, failing fast and surfacing an error is IMHO the safer and clearer choice.

This distinction matters because some ecosystems map PURLs directly to URLs. A PURL like:

pkg:golang/github.com/foo/..%2Fbar/artifact

could trick a consumer into resolving bar/artifact instead of foo/artifact if the parser silently accepts it. By requiring an error, the spec prevents that entire class of misinterpretation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespaces should not contain %2F because of the way PURL performs encoding and decoding, not because it is an illegal filename character on some operating systems. If you read the PURL pkg:generic/a%2Fb/c the a%2Fb cannot be represented and turns into a/b while parsing.

However, I don't think it really matters for namespaces and it's probably better not to do this. I guess this avoids potentially confusing parses like pkg:generic/a%2Fb/c%2Fd/e%2Ff having a namespace a/b/c/d and name e/f, and the unnecessary edge case about empty segments created by a previous rule about namespaces. I still believe that namespaces are a mistake that needs to be fixed by treating the part between the type and the version as an opaque path string, similar to how it works in URL, with the meaning defined by the package type. This new rule may be a step in the wrong direction because it forbids certain character sequences from being in that segment in a convoluted way. For example, pkg:golang has no namespace but the name often contains slashes, so if namespaces are eliminated then the path would still need to be something like github.com%2Fpackage-url%2Fexample for compatibility with namespace+name implementations, which would be allowed because there are no %2F characters followed by an unencoded / character. However, in some other package type (maybe pkg:swid), Acme A%2FB/Widgets would need to be forbidden because parsers implementing this proposed addition to the spec would see the %2F as being an illegal namespace character, making it complicated to deal with company names ending in "A/B".

Maybe it's too broken already and fixing namespaces would need to wait for a pkg2 that follows URL parsing semantics, parsing only from the left and not trying to apply special meaning to the path strings used by different backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to report an error here is to protect consumers of PURLs from malformed or malicious input. By error I mean whatever the parser uses to signal a problem (exception, exit status, etc.).

Unlike subpaths, namespaces are not paths, and should not be blanket sanitized as if they are paths.

Copy link
Member

@jkowalleck jkowalleck Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespaces should not contain %2F because of the way PURL performs encoding and decoding, not because it is an illegal filename character on some operating systems. If you read the PURL pkg:generic/a%2Fb/c the a%2Fb cannot be represented and turns into a/b while parsing.

per current spec, namespace segments MUST NOT contain %2F anyway (see grammar in #578) - and if they did, then the whole thing is not a valid PRUL. So far, there is no rule what to do if any forbidden chars occurred.
This is what this PR tries to fix: it adds a rule that expresses to report the error and fail the parsing all along. (in this case, Postel's Law must not be applied - fail and report - no "try to fix it" approach.)

- Apply type-specific normalization to each segment if needed
- Join segments back with a '/'
- This is the `namespace`