Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

truncate_html does not respect Unicode #35

Open
adamflorin opened this issue Feb 13, 2013 · 17 comments
Open

truncate_html does not respect Unicode #35

adamflorin opened this issue Feb 13, 2013 · 17 comments

Comments

@adamflorin
Copy link

Hi @hgmnz,

A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.

I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.

You can paste this code into an .rb file and run it to see what I mean:

# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."

# From TruncateHtml::HtmlString
# 
def regex
  /(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end

# scan normally respects unicode.
puts unicode_string.scan(/.*/).join

# but this regex does not.
puts unicode_string.scan(regex).join

The result at the command line is

Up Arrow (↑) points up.
Up Arrow () points up.

Thanks!

@hgmnz
Copy link
Owner

hgmnz commented Feb 13, 2013

It's going to take me a little while to go describe the regex I'm afraid, but I'll take this as a bug report and try to fix it soon.

If you get to it sooner, please submit a pull request!

Thanks

@adamflorin
Copy link
Author

OK, thanks!

@dmfrancisco
Copy link
Contributor

I have the same problem using ruby 2.0.0-p0. It does not happen (to me) with ruby 1.9.3. It seems it uses a new regexp engine, which probably isn't fully backward compatible. I replaced \w with \p{word} (in the regex method) and looks like it solves this, but I'm not sure of the implications.

@dmfrancisco
Copy link
Contributor

Oops. It seems this has been solved on master already 😃 Thanks for the hard work.

@hgmnz
Copy link
Owner

hgmnz commented Mar 31, 2013

Thanks for verifying @dmfrancisco :)

@hgmnz hgmnz closed this as completed Mar 31, 2013
@dmfrancisco
Copy link
Contributor

Sorry @hgmnz, I should have tested this better before commenting. My tests pass for portuguese special characters but I tested the original string provided by @adamflorin and it seems to fail. Example:

truncate_html "café ↑ périferôl"
# => "café  périferôl"

In short, it seems the master branch fixes the issue for alphabets with special characters but not for unicode symbols.

@hgmnz
Copy link
Owner

hgmnz commented Mar 31, 2013

ahhh, thanks. Reopening this then

@hgmnz hgmnz reopened this Mar 31, 2013
@halida
Copy link
Contributor

halida commented Jun 5, 2013

Aha,truncate_html filt all the Chinese unicode words, this bug still exists.

@halida
Copy link
Contributor

halida commented Jun 6, 2013

Looks like it works on master, and not work on gem?

@hgmnz
Copy link
Owner

hgmnz commented Jun 6, 2013

Looks like it works on master, and not work on gem?

Is that the case? There doesn't seem any changes since 0.9.2 that would do that, but it could be accidental

@halida
Copy link
Contributor

halida commented Jun 6, 2013

@hgmnz Yes, http://gurudigger.com/products/tuicool I use truncate_html to implement "More" on this page。

@alex94040
Copy link

This is broken in version 0.9.2 of the gem.

@afriqs
Copy link

afriqs commented Nov 7, 2013

I confirm, broken in version 0.9.2 and works for me using master branch. What about a 0.9.3 new gem ? ;)

@aguynamedben
Copy link

This is particularly painful in HTML use-cases (i.e. truncating stuff from TinyMCE) where random spaces are dropped because the &nbsp; character is not respected.

The second space is the 2 byte character Unicode for &nbsp;

[34] pry(main)> truncate_html("what about this: ↑")
=> "what aboutthis:"

Using 0.9.2

@aguynamedben
Copy link

I found this library that does not drop Unicode characters. https://github.com/nono/HTML-Truncator

Time for a beer!

@lachlanjc
Copy link

This is still an issue — emoji disappears 😢

@togiberlin
Copy link

I confirm, version 0.9.3 removes Euro (€) and UK Pound Sterling (£) symbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants