Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non ASCII characters are not allowed in the path #40

Open
asok opened this issue May 13, 2022 · 4 comments
Open

Non ASCII characters are not allowed in the path #40

asok opened this issue May 13, 2022 · 4 comments

Comments

@asok
Copy link

asok commented May 13, 2022

Hi,
I'm getting such error:

irb(main):001:0> require 'uri'
=> false
irb(main):002:0> URI::HTTPS.build(host: 'example.com', path: '/łódź')
Traceback (most recent call last):
       10: from /Users/asokolnicki/.rubies/ruby-2.6.3/bin/irb:23:in `<main>'
        9: from /Users/asokolnicki/.rubies/ruby-2.6.3/bin/irb:23:in `load'
        8: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
        7: from (irb):2
        6: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/http.rb:62:in `build'
        5: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:137:in `build'
        4: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:137:in `new'
        3: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:193:in `initialize'
        2: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:807:in `path='
        1: from /Users/asokolnicki/.rubies/ruby-2.6.3/lib/ruby/2.6.0/uri/generic.rb:761:in `check_path'
URI::InvalidComponentError (bad component(expected absolute path component): /łódź)

I thought that the path component is allowed to contain any UTF-8 character.

@noraj
Copy link

noraj commented Feb 19, 2023

cf. ruby/webrick#110, especially this comment ruby/webrick#110 (comment).

This is because URI doesn't support RFC 3987 (Internationalized Resource Identifier (IRI)).

@jeremyevans
Copy link
Contributor

No, a URI path is not allowed to contain arbitrary UTF-8 characters. Non-ASCII UTF-8 characters must be percent encoded, and even some ASCII characters must be percent encoded. It's true that the URI library doesn't support IRIs. That's not a bug, there should probably be a separate library used for IRIs.

@noraj
Copy link

noraj commented Feb 20, 2023

IRIs have not been integrated into URIs to keep the retro-compatibility. But IRI is extending URI.

rfc 3987 - section 3

IRIs are meant to replace URIs in identifying resources for
protocols, formats, and software components that use a UCS-based
character repertoire.

Ruby has a huge Unicode support (in strings, regexp, etc.) so not supporting Unicode in uri module is an exception.

If one does not want to change the behavior of the default parse method, maybe the uri module could include a :unicode / :iri or whatever option to the parse method or an alternative method parse_iri that would accept an IRI and map it to a URI then pass the resulting URI to the classic parse method than handle only ASCII URI. rfc 3987 explains how to map IRI to URI and URI to IRI.

As IRI is extending URI and deeply linked to it I would more see IRI support integrated in new methods in the URI module rather than having a separate module only for URI. But that's just my POV and I may not be the better suited nor more experienced here.

That's not a bug

I agree, that more a feature request to support modern usage where Unicode is widely spread and massively democratized.

@mkasberg
Copy link

Just ran into this today... noraj's comments above seem spot-on to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants