Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize issue when face special character #455

Closed
calvinsugianto opened this issue Mar 9, 2022 · 7 comments
Closed

normalize issue when face special character #455

calvinsugianto opened this issue Mar 9, 2022 · 7 comments
Labels

Comments

@calvinsugianto
Copy link

calvinsugianto commented Mar 9, 2022

Hello guys, I got issue when I parse special character like é
with this code
Addressable::URI.parse(url).normalize
it will change é into %C3%A9 and this caused an error.

what I need is to parse it into UTF-8 Format become
e%CC%81

is is possible with this gem ?

example url:
https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boisé-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg

Instead of this current parsing condition:

https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg -> wrong

it should be become this UTF-8 format:
https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boise%CC%81-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg -> correct

@dentarg
Copy link
Collaborator

dentarg commented Jul 19, 2023

No, it is not possible.

https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg is correct

As an example, that is what I get back from Google Chrome if I enter the url with é in the path: https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois%C3%A9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg (when I copy the URL)

From ruby/uri#40 (comment)

No, a URI path is not allowed to contain arbitrary UTF-8 characters. Non-ASCII UTF-8 characters must be percent encoded, and even some ASCII characters must be percent encoded.

and if you would try to parse that URL using Ruby uri, it would blow up

irb(main):017:0> URI("https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Boisé-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg")
/Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/rfc3986_parser.rb:20:in `split': URI must be ascii only "https://sothebysrealty.ca/insightblog/wp-content/uploads/2020/03/Featured-62-Ch-du-Bois\u00E9-Lac-Beauport-QC-sothebys-international-realty-canada-1440x570.jpg" (URI::InvalidURIError)
	from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/rfc3986_parser.rb:71:in `parse'
	from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/common.rb:193:in `parse'
	from /Users/dentarg/.arm64_rubies/3.2.2/lib/ruby/3.2.0/uri/common.rb:722:in `URI'

@dentarg dentarg closed this as completed Jul 19, 2023
@maxime-carbonneau
Copy link

I would like to re-open this issue. I think there is a misunderstanding.

There is (at least) 2 ways to represent the letter « é » :

  1. The character itself « é », which is represent by the number 233
  2. The sequence « e » + « some kind of fronttick », which are represent by numbers 101 + 769

That how « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg » is correctly convert to « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg »

@dentarg
Copy link
Collaborator

dentarg commented Aug 23, 2024

Google Chrome is converting http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg to http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg for me

@maxime-carbonneau
Copy link

The link is an image coming from http://ferrisson.com/pierre-paul-cote-csq/

According to my Google Inspector, the link should be converted to http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg
Capture d’écran, le 2024-08-23 à 16 29 07

I also post Safari Inspector since the conversion is more obvious.
Capture d’écran, le 2024-08-23 à 16 27 17

@dentarg
Copy link
Collaborator

dentarg commented Aug 23, 2024

That how « http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg » is correctly convert ...

I copied from your message here on GitHub when I got http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

I can also see the source code on http://ferrisson.com/pierre-paul-cote-csq/ referencing http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Côté-2.0-M.jpg and when I enter that into the address bar in Google then image loads and if I copy the URL from the address bar it is http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-Co%CC%82te%CC%81-2.0-M.jpg

@dentarg
Copy link
Collaborator

dentarg commented Aug 23, 2024

I copied from your message here on GitHub when I got http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

To be more clear, right click on the URL and "Copy Link Address" gave me http://ferrisson.com/wp-content/uploads/2014/04/2014-04-1-P.-P.-C%C3%B4t%C3%A9-2.0-M.jpg

@dentarg
Copy link
Collaborator

dentarg commented Aug 23, 2024

Anyway, even if Chrome is supporting more representations I'm not sure we can do that in Addressable (see the previous comments in the thread)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants