Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some URL-safe special characters cannot be used in Page path #12993

Open
trwnh opened this issue Oct 29, 2024 · 0 comments
Open

Some URL-safe special characters cannot be used in Page path #12993

trwnh opened this issue Oct 29, 2024 · 0 comments

Comments

@trwnh
Copy link

trwnh commented Oct 29, 2024

What version of Hugo are you using (hugo version)?

$ hugo version
hugo v0.136.5+extended linux/amd64 BuildDate=unknown

Does this issue reproduce with the latest release?

yes

Steps to reproduce

  1. Add a page (e.g. using a content adapter) whose path includes an exclamation mark (e.g. yu-gi-oh!)

Expected behavior

The page is created with the url including the exclamation mark (e.g. /tags/yu-gi-oh!)

Actual behavior

The exclamation mark seems to be sanitized out (e.g. /tags/yu-gi-oh, creating a path conflict with an already existing page)

Additional information

https://gohugo.io/methods/page/path/ describes the following:

To determine the logical path for pages backed by a file, Hugo starts with the file path, relative to the content directory, and then:

  • Strips the file extension
  • Strips the language identifier
  • Converts the result to lower case
  • Replaces spaces with hyphens

The value returned by the Path method on a Page object is independent of content format, language, and URL modifiers such as the slug and url front matter fields.

Nowhere in these 4 steps does it say anything about removing URL-safe characters entirely. This issue seems to occur for some URL-safe characters, but not all. These are the ones that work:

  • content/foo-bar.md => /foo-bar/ (correct).
  • content/foo_bar.md => /foo_bar/ (correct).
  • content/foo.bar.md => /foo.bar/ (correct).
  • content/foo+bar.md => /foo+bar/ (correct).
  • content/[email protected] => /foo@bar/ (correct).
  • content/foo~bar.md => /foo~bar/ (correct).

And these are the ones that don't work:

  • content/foo$bar.md => /foobar/ (incorrect). /foo$bar/ leads to a Page Not Found.
  • content/foo!bar.md => /foobar/ (incorrect). /foo!bar/ leads to a Page Not Found.
  • content/foo*bar.md => /foobar/ (incorrect). /foo*bar/ leads to a Page Not Found.
  • content/foo'bar.md => /foobar/ (incorrect). /foo'bar/ leads to a Page Not Found.
  • content/foo(bar.md => /foobar/ (incorrect). /foo(bar/ leads to a Page Not Found.
  • content/foo)bar.md => /foobar/ (incorrect). /foo)bar/ leads to a Page Not Found.
  • content/foo;bar.md => /foobar/ (incorrect). /foo;bar/ leads to a Page Not Found.
  • content/foo=bar.md => /foobar/ (incorrect). /foo=bar/ leads to a Page Not Found.
  • content/foo:bar.md => /foobar/ (incorrect). /foo:bar/ leads to a Page Not Found.
  • content/foo[bar.md => /foobar/ (incorrect). /foo[bar/ leads to a Page Not Found.
  • content/foo]bar.md => /foobar/ (incorrect). /foo]bar/ leads to a Page Not Found.
  • content/foo&bar.md => /foobar/ (incorrect). /foo&bar/ leads to a Page Not Found.
  • content/foo,bar.md => /foobar/ (incorrect). /foo,bar/ leads to a Page Not Found.

Again, I would expect that URL-safe characters are preserved, because there is nothing to suggest that they should be removed. Out of the "reserved" characters per the URI RFC (: / ? # [ ] @ ! $ & ' ( ) * + , ; =), we can probably eliminate characters that have special semantics in HTTP(S), like / for path segment separation, ? for query components, and # for fragment components. But most everything else should be fine to use in the path component. It feels weird to allow + or @, but not ! or $.

Example URLs with special characters in them:

Proposed resolution

Stop removing URL-safe characters from the Page.path, or at least make it clear (or better yet, configurable) which characters will be removed and which ones won't.

@trwnh trwnh changed the title Some URL-safe characters cannot be used in Page path Some URL-safe special characters cannot be used in Page path Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant