truncate_html does not respect Unicode #35

adamflorin · 2013-02-13T02:36:15Z

A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.

I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.

You can paste this code into an .rb file and run it to see what I mean:

# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."

# From TruncateHtml::HtmlString
# 
def regex
  /(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end

# scan normally respects unicode.
puts unicode_string.scan(/.*/).join

# but this regex does not.
puts unicode_string.scan(regex).join

The result at the command line is

Up Arrow (↑) points up.
Up Arrow () points up.

Thanks!

The text was updated successfully, but these errors were encountered:

hgmnz · 2013-02-13T19:35:51Z

It's going to take me a little while to go describe the regex I'm afraid, but I'll take this as a bug report and try to fix it soon.

If you get to it sooner, please submit a pull request!

Thanks

adamflorin · 2013-02-13T19:43:44Z

OK, thanks!

dmfrancisco · 2013-03-30T19:50:53Z

I have the same problem using ruby 2.0.0-p0. It does not happen (to me) with ruby 1.9.3. It seems it uses a new regexp engine, which probably isn't fully backward compatible. I replaced \w with \p{word} (in the regex method) and looks like it solves this, but I'm not sure of the implications.

dmfrancisco · 2013-03-31T13:21:04Z

Oops. It seems this has been solved on master already 😃 Thanks for the hard work.

hgmnz · 2013-03-31T16:12:15Z

Thanks for verifying @dmfrancisco :)

dmfrancisco · 2013-03-31T16:35:44Z

Sorry @hgmnz, I should have tested this better before commenting. My tests pass for portuguese special characters but I tested the original string provided by @adamflorin and it seems to fail. Example:

truncate_html "café ↑ périferôl"
# => "café  périferôl"

In short, it seems the master branch fixes the issue for alphabets with special characters but not for unicode symbols.

hgmnz · 2013-03-31T18:08:41Z

ahhh, thanks. Reopening this then

halida · 2013-06-05T12:24:58Z

Aha，truncate_html filt all the Chinese unicode words, this bug still exists.

halida · 2013-06-06T00:16:20Z

Looks like it works on master, and not work on gem?

hgmnz · 2013-06-06T03:41:17Z

Looks like it works on master, and not work on gem?

Is that the case? There doesn't seem any changes since 0.9.2 that would do that, but it could be accidental

halida · 2013-06-06T03:44:26Z

@hgmnz Yes, http://gurudigger.com/products/tuicool I use truncate_html to implement "More" on this page。

alex94040 · 2013-10-15T23:29:38Z

This is broken in version 0.9.2 of the gem.

afriqs · 2013-11-07T10:20:33Z

I confirm, broken in version 0.9.2 and works for me using master branch. What about a 0.9.3 new gem ? ;)

aguynamedben · 2014-07-17T00:43:47Z

This is particularly painful in HTML use-cases (i.e. truncating stuff from TinyMCE) where random spaces are dropped because the   character is not respected.

The second space is the 2 byte character Unicode for  

[34] pry(main)> truncate_html("what about this: ↑")
=> "what aboutthis:"

Using 0.9.2

aguynamedben · 2014-07-17T01:54:08Z

I found this library that does not drop Unicode characters. https://github.com/nono/HTML-Truncator

Time for a beer!

lachlanjc · 2015-11-23T04:12:12Z

This is still an issue — emoji disappears 😢

togiberlin · 2016-02-05T09:07:43Z

I confirm, version 0.9.3 removes Euro (€) and UK Pound Sterling (£) symbols.

hgmnz closed this as completed Mar 31, 2013

hgmnz reopened this Mar 31, 2013

wwcline mentioned this issue Mar 17, 2016

Support for non alphabetical unicode characters #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

truncate_html does not respect Unicode #35

truncate_html does not respect Unicode #35

adamflorin commented Feb 13, 2013

hgmnz commented Feb 13, 2013

adamflorin commented Feb 13, 2013

dmfrancisco commented Mar 30, 2013

dmfrancisco commented Mar 31, 2013

hgmnz commented Mar 31, 2013

dmfrancisco commented Mar 31, 2013

hgmnz commented Mar 31, 2013

halida commented Jun 5, 2013

halida commented Jun 6, 2013

hgmnz commented Jun 6, 2013

halida commented Jun 6, 2013

alex94040 commented Oct 15, 2013

afriqs commented Nov 7, 2013

aguynamedben commented Jul 17, 2014

aguynamedben commented Jul 17, 2014

lachlanjc commented Nov 23, 2015

togiberlin commented Feb 5, 2016

truncate_html does not respect Unicode #35

truncate_html does not respect Unicode #35

Comments

adamflorin commented Feb 13, 2013

hgmnz commented Feb 13, 2013

adamflorin commented Feb 13, 2013

dmfrancisco commented Mar 30, 2013

dmfrancisco commented Mar 31, 2013

hgmnz commented Mar 31, 2013

dmfrancisco commented Mar 31, 2013

hgmnz commented Mar 31, 2013

halida commented Jun 5, 2013

halida commented Jun 6, 2013

hgmnz commented Jun 6, 2013

halida commented Jun 6, 2013

alex94040 commented Oct 15, 2013

afriqs commented Nov 7, 2013

aguynamedben commented Jul 17, 2014

aguynamedben commented Jul 17, 2014

lachlanjc commented Nov 23, 2015

togiberlin commented Feb 5, 2016