Text artifacts on sites not using UTF-8 (#7) #87

necaran · 2024-08-30T23:11:33Z

When encoding is not UTF-8, the page goes wrong.
It does not only break the text but also page content somethings.

Website: https://www.jalan.net
Encoding: Shift_JIS

Full screenshots

Original	UltimaDark

SagXD · 2024-08-31T07:59:20Z

#7

necaran · 2024-09-10T21:23:51Z

I just found that broken/incomplete content might a different issue.
It breaks scripts at some websites regardless of the encoding.
For example, https://hub.docker.com (UTF-8) was stuck at the spinning icon and completely unusable.

Version: v1.5.54
Error at https://www.jalan.net

Uncaught TypeError: c is undefined
    rentacarChk https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:2051
    setPeriod https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1205
    setTimeout handler*setPeriod https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1176
    refreshStayLength https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1073
    setTimeout handler*refreshStayLength https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1072
    returnMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1033
    setTimeout handler*returnMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1014
    deptDateOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:992
    setTimeout handler*deptDateOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:980
    deptMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:938
    init https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:172
    jQuery 11
dp2_d_home.js:2051:16

Error at https://hub.docker.com/

Uncaught TypeError: S is null
    rZ https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    rZ https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    Rl https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    no https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    no https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    iL https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    o https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    aL https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    p https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    ee https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    p https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    <anonymous> https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-NB7LPVG4.js:4
chunk-27NNUPQV.js:33:23374

ThomazPom · 2024-09-11T11:54:28Z

Thank you for taking the time to test version 1.5.54, which has been released but not yet published. You identified the issue with hub.docker.com that was introduced in this version before it could be published.

chmichael · 2024-10-08T22:53:09Z

Broken Text (Greek)
https://www.hyperhosting.gr/grdomains/
https://www.e-shop.gr/

ThomazPom · 2024-10-09T20:07:28Z

This bug is one of the oldest in UltimaDark, but I've recently made a promising breakthrough in understanding its root cause. All affected sites share a common factor: their pages are being edited by JavaScript. However, these JavaScript files and the pages themselves are encoded using a charset other than UTF-8.

To modify the page, UltimaDark decodes it from the source's detected charset and re-encodes it to UTF-8, which is the only charset supported by the TextEncoder. The problem arises when the page, now assuming everything is UTF-8, tries to decode included scripts as UTF-8, leading to encoding issues.

I think there is two valid solutions here :

A potential solution would be to intercept the scripts and re-encode them to UTF-8 before they are executed exactly as we already do for the webpage.
Alternatively, we could restore the original charset of each script in the script tag, as outlined here: Script charset attribute.

chmichael · 2024-10-09T21:35:37Z

I think the 2nd is the fastest ?

ThomazPom · 2024-10-09T22:12:06Z

It is true, but the charset attribute is deprecated. Using it would require searching through pages with incorrect base encodings, applying this deprecated attribute to scripts. Additionally, we would need to monitor the page for all the dozen of different methods a web developer could use to add a script, such as createElement, cloneNode, and insertAfter. UltimaDark already has similar mechanisms in place but it's always a pain to cover them all.

Solution 1, implemented in version 1.5.70, is triggered only when encoding mismatches are detected, significantly reducing the need for re-encoding. The re-encoding process itself is not resource-intensive enough to be noticeable in terms of performance.

Take a look at
Commit

chmichael · 2024-10-09T22:59:12Z

Well just tested 1.5.70 and the fix didn't work.

necaran · 2024-10-10T05:11:44Z

With solution 1, the page sends requests in wrong encoding back to the server (via HTML form or XHR).
For example if I search for 東京 (Tokyo), it returns the result for 譚ｱ莠ｬ.

ThomazPom · 2024-10-10T06:56:10Z

The same result would have occurred with either solution.
I wonder if i need to also re encode XHR calls and if it will get the same result.
I feel there might be dozens of edge cases. I'm looking forward for solutions.

necaran · 2024-10-10T08:31:30Z

I have not dug into the codes, but uBlock Origin (scriptlet injection, resource redirects) and userscript loaders such as Violentmonkey inject scripts without messing the charset.
At a first glance, both uBlock Origin and Violentmonkey use TextEncoder but they do not touch the charset of the original page.

ThomazPom · 2024-10-16T21:48:42Z

1.5.71:
I’ve fixed the issue with pages using non-UTF-8 charsets. Instead of forcing the page into UTF-8, which caused all kinds of issues, I now just let the page keep its original charset. I’m using a hacky but effective method to still write in UTF-8 without messing with the page’s encoding. This avoids the trouble of re-encoding into incompatible charsets. There’s still a potential issue if the page explicitly uses the charset keyword in CSS (not verified), or the content property in CSS (not verified), but this should be rare enough not to worry about.

Let me know if the result is acceptable or completely off 😂😂😂.

Vintagemotors · 2024-10-17T02:51:20Z

jalan.net has a javascript error when trying to click one of the words/ phrases that it is suggesting you click to search and is also very slow to load. hyperhosting.gr/grdomains is blank (1.5.72)

necaran · 2024-10-17T08:25:28Z

Found some pages that are still garbled as of 1.5.72:

http://charset.7jp.net/ (Shift_JIS)
http://charset.7jp.net/euc.html (EUC-JP)
http://charset.7jp.net/jis.html (ISO-2022-JP)

p.s. There is a UTF-8 page http://charset.7jp.net/jis0212.html on the same site and is correctly rendered.

ThomazPom · 2024-10-17T16:20:50Z

I've identified and resolved some bugs related to the new parsing method, which is now available in version 1.5.73.

While this update addresses templating issues for the following sites:

hyperhosting.gr
jalan.net

It does not resolve charset issues, which will require more in-depth investigation at this time.

ThomazPom · 2024-10-18T17:07:38Z

Release v1.5.74 available now fixed more templating issues and more charset issues
http://charset.7jp.net/ (Shift_JIS)
http://charset.7jp.net/euc.html (EUC-JP)
http://charset.7jp.net/jis.html (ISO-2022-JP)

necaran · 2024-10-18T19:03:26Z

Thank you for the hard work! Now they are readable.
But there are still some glitches.

https://www.jalan.net/
Some scripts are not working.
If you click a keyword from the popup panel, the keyword should be added to the input, but with UltimaDark it is not.
In addition, scripts to load banners are not working (uBlock and Tracking Protection disabled)

<script type="text/template" id="jsi-template-feed-banner">
  <a href="<%= link %>" class="dp_long_bnr3">
    <img src="<%= images[0] %>" alt="">
  </a>
  <a href="<%= link %>" class="dp_long_bnr2">
    <img src="<%= images[1] %>" alt="">
  </a>
</script>

https://www.4gamer.net/
It use @charset in CSS and pseudo-elements for text content.
記事 X Postランキング becomes 鐃緒申鐃緒申 X Post鐃緒申鵐⑤鐃� on the page.

@charset "EUC-JP";

...

.tweet_ranking_container:before,
.tweet_ranking_container.normal:before{
	width:160px;
	margin-left:-80px;
	content:"記事 X Postランキング";
}

https://www.mediafire.com/file/lytq9t6wxaypsyl/DwarvenTodcraft.ttf/file
The download page of mediafire is blank, probably templating issues.

ThomazPom · 2024-10-20T16:25:03Z

We will have to find a way to reencode properly in all textdecoder supported charsets which can be any of the following values since the workaround does not fully works as expected:

That's a lot of work .

https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding
The recommended encoding for the Web: 'utf-8'.
The legacy single-byte encodings: 'ibm866', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8''`, 'iso-8859-8i', 'iso-8859-10', 'iso-8859-13', 'iso-8859-14', 'iso-8859-15', 'iso-8859-16', 'koi8-r', 'koi8-u', 'macintosh', 'windows-874', 'windows-1250', 'windows-1251', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', or 'x-mac-cyrillic'.
The legacy multi-byte Chinese (simplified) encodings: 'gbk', 'gb18030'.
The legacy multi-byte Chinese (traditional) encoding: 'big5'.
The legacy multi-byte Japanese encodings: 'euc-jp', 'iso-2022-jp', 'shift-jis'.
The legacy multi-byte Korean encodings: 'euc-kr'.
The legacy miscellaneous encodings: 'utf-16be', 'utf-16le', 'x-user-defined'.
A special encoding, 'replacement'. This decodes empty input into empty output and any other arbitrary-length input into a single replacement character. It is used to prevent attacks that mismatch encodings between the client and server. The following encodings also map to the replacement encoding: ISO-2022-CN, ISO-2022-CN-ext, 'iso-2022-kr', and 'hz-gb-2312'.

ThomazPom · 2024-10-26T22:04:10Z

I've dedicated significant effort to this issue and discovered methods to rencode each of these encodings. Since browsers only support UTF-8 re-encoding, I created an algorithm that observes decoded values, learns from them, and maps characters to their byte values from the original data.

During re-encoding, the algorithm places the mapped byte values for each charset back into the data stream, ensuring accuracy for each character. You can try version 1.5.75.

Some charset might be broken, but this is a good proof of concept

ThomazPom · 2024-10-27T08:03:32Z

jalan.net has a javascript error when trying to click one of the words/ phrases that it is suggesting you click to search and is also very slow to load. hyperhosting.gr/grdomains is blank (1.5.72)

The new re-encoding algorithm correctly renders jalan.net . About hyperhosting.gr/grdomains, the issue was related to a templating quirk I’ve just fixed in version 1.5.76.

Vintagemotors · 2024-11-03T20:27:39Z

1.5.71 to 1.5.76 (current) were dedicated to fixing non utf8 websites. Fonts and CSSes are now loaded in their original encoding, eliminating the content keyword issue.

Does this mean we can consider non UTF-8 related encoding issues resolved?

necaran · 2024-11-04T07:42:53Z

CSS issues as addressed in #87 (comment) are still present (https://www.4gamer.net/).
There are also glitches on https://www.jalan.net/yad348331/ where a custom font and escape codes are used to show icons.

https://www.jalan.net/assets/jalan-iconfont/jalan-iconfont.css

.ji-activity::before{content:"\F001";color:#FFF}

With UltimarDark (as shown by devtools)

.ji-activity::before { content: "��"; color: rgb(255, 255, 255); }

Original	UltimaDark

p.s.
This is off topic but I found that HTML4 font color is not showing correctly.
Variable names for [ud-html4-support] elements are mismatching.

https://www.quackit.com/html/html_editors/scratchpad/?example=/html/html_4/tags/html_font_tag:
The variable name in element.style is --ud-html4-color, while which in injected CSS is --ud-fg--ud-html4-color.

<font face="cursive,serif" size="4" style="--ud-html4-color: #ffb84d;" ud-html4-support="true">The HTML font tag is now deprecated. You should use <a href="/css/properties/css_font.cfm" target="_blank">CSS font</a> to set font properties instead.</font>

[ud-html4-support] { color: var(--ud-fg--ud-html4-color,rgba(255,255,255,1)) !important; }

ThomazPom · 2024-11-05T16:51:35Z

Hello @Vintagemotors, @chmichael, @necaran, and @SagXD,

Note: The following information applies specifically to non-UTF-8 pages. While these pages are less common, they still exist on the web.

@Vintagemotors, the UltimaDark embedded reencoder relies on its built-in decoder's ability to learn. This capability, in its standard mode, requires configuration for each charset, which you can find here and maps encoding bytecount to character codepoitn ranges . If there is misalignement in this map, adjustments are needed, and we might notice diamond question marks or incorrect characters in the HTML output.

The standard decoding mode typically takes 7 to 15 ms to process the heaviest pages.

Additionally, the UltimaDark decoder has a dynamic mode, which operates without needing specific configurations. However, it requires 20 to 30 ms for decoding the heaviest non-UTF-8 pages. I think it would be beneficial to enable this mode temporarily for your feedback. You can activate it until the browser restarts by pasting uDarkDecodeSimple=dynamicDecoderCharacterCounter into the UltimaDark Inspect console under about:debugging#/runtime/this-firefox.

As @necaran pointed out, decoding issues can sometimes stem from the less common @charset CSS attribute, which allows a file to declare a different charset mid-way through the code. I have identified a technique to handle these cases, but it hasn't been implemented yet.

@necaran, thank you for catching that the HTML4 support mode broke! I'll fix it soon. Looks like a mistake.

@Vintagemotors, I've made significant advancements in darkening techniques, which required me to recode large and core sections of UltimaDark. The new structure aims to make the code more accessible for other developers, without compromising on quality. Due to the scale of these changes, there may be some minor, resolvable regressions on a few websites.

Thank you all for your continued support and feedback!

Vintagemotors · 2024-11-06T00:21:50Z

Your new dynamic decoder mode does not appear to correct the remaining question mark characters - at least in this case. I will however continue investigating how hard it would be to implement test site(s) to validate these alternate encodings more thoroughly than relying on organically encountering issues. Also the issue with UltimaDark tampering with the site cache and it failing to be cleared by ctrl f5 is still present as it happened on jalan.net (remains even with the extension disabled and when installing a different version and refreshing).

1st example

Original (Corrupted cache)	UltimaDark with and without dynamic decoder	Original

Refreshing the console before or after sending uDarkDecodeSimple=dynamicDecoderCharacterCounter does not appear to influence at least these results though as necaran mentioned these cases are probably CSS related.

2nd example

ThomazPom · 2024-11-06T08:30:57Z

I forgot to mention that clicking on the refresh button of the Ultimadark console is mandatory before overriding the mode to dynamic, as it could been polluted by the normal mode’s bad learning from configurations.
Using this refresh button can be handy as it also restore standard mode as decoding method without restarting browser until pasting the dynamic override.

The issues can also arise due to CSS using @charset, especially for floating text like here, since it’s typical of the content CSS keyword. As said, these cases are not covered yet
Typicaly, if there is a @charset déclaration before this:
.ji-activity::before{content:"\F001";color:#FFF} as @necaran noticed.

I think the dynamic mode works on https://www.4gamer.net/ where it not use css content keyword

About testing a lot of URLs, my method is as follows:

1.	Decode (step 1)
2.	Re-encode
3.	Decode again (step 2)
4.	Compare the results from step 1 and step 2.

If the re-encoding is incorrect, the automatic comparison will detect it. I can also provide a test function if needed.

Also nex

‘Thank you!

Vintagemotors changed the title ~~Page broken when encoding is not UTF-8~~ Text artifacts on sites not using UTF-8 (#7) Sep 2, 2024

Vintagemotors added bug Something isn't working Critical labels Sep 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text artifacts on sites not using UTF-8 (#7) #87

Text artifacts on sites not using UTF-8 (#7) #87

necaran commented Aug 30, 2024 •

edited

Loading

SagXD commented Aug 31, 2024

necaran commented Sep 10, 2024

ThomazPom commented Sep 11, 2024

chmichael commented Oct 8, 2024 •

edited

Loading

ThomazPom commented Oct 9, 2024

chmichael commented Oct 9, 2024

ThomazPom commented Oct 9, 2024

chmichael commented Oct 9, 2024

necaran commented Oct 10, 2024

ThomazPom commented Oct 10, 2024

necaran commented Oct 10, 2024 •

edited

Loading

ThomazPom commented Oct 16, 2024

Vintagemotors commented Oct 17, 2024

necaran commented Oct 17, 2024

ThomazPom commented Oct 17, 2024 •

edited

Loading

ThomazPom commented Oct 18, 2024

necaran commented Oct 18, 2024 •

edited

Loading

ThomazPom commented Oct 20, 2024 •

edited

Loading

ThomazPom commented Oct 26, 2024 •

edited

Loading

ThomazPom commented Oct 27, 2024 •

edited

Loading

Vintagemotors commented Nov 3, 2024

necaran commented Nov 4, 2024 •

edited

Loading

ThomazPom commented Nov 5, 2024 •

edited

Loading

Vintagemotors commented Nov 6, 2024 •

edited

Loading

ThomazPom commented Nov 6, 2024 •

edited

Loading

Text artifacts on sites not using UTF-8 (#7) #87

Text artifacts on sites not using UTF-8 (#7) #87

Comments

necaran commented Aug 30, 2024 • edited Loading

SagXD commented Aug 31, 2024

necaran commented Sep 10, 2024

ThomazPom commented Sep 11, 2024

chmichael commented Oct 8, 2024 • edited Loading

ThomazPom commented Oct 9, 2024

chmichael commented Oct 9, 2024

ThomazPom commented Oct 9, 2024

chmichael commented Oct 9, 2024

necaran commented Oct 10, 2024

ThomazPom commented Oct 10, 2024

necaran commented Oct 10, 2024 • edited Loading

ThomazPom commented Oct 16, 2024

Vintagemotors commented Oct 17, 2024

necaran commented Oct 17, 2024

ThomazPom commented Oct 17, 2024 • edited Loading

ThomazPom commented Oct 18, 2024

necaran commented Oct 18, 2024 • edited Loading

ThomazPom commented Oct 20, 2024 • edited Loading

ThomazPom commented Oct 26, 2024 • edited Loading

ThomazPom commented Oct 27, 2024 • edited Loading

Vintagemotors commented Nov 3, 2024

necaran commented Nov 4, 2024 • edited Loading

ThomazPom commented Nov 5, 2024 • edited Loading

Vintagemotors commented Nov 6, 2024 • edited Loading

ThomazPom commented Nov 6, 2024 • edited Loading

necaran commented Aug 30, 2024 •

edited

Loading

chmichael commented Oct 8, 2024 •

edited

Loading

necaran commented Oct 10, 2024 •

edited

Loading

ThomazPom commented Oct 17, 2024 •

edited

Loading

necaran commented Oct 18, 2024 •

edited

Loading

ThomazPom commented Oct 20, 2024 •

edited

Loading

ThomazPom commented Oct 26, 2024 •

edited

Loading

ThomazPom commented Oct 27, 2024 •

edited

Loading

necaran commented Nov 4, 2024 •

edited

Loading

ThomazPom commented Nov 5, 2024 •

edited

Loading

Vintagemotors commented Nov 6, 2024 •

edited

Loading

ThomazPom commented Nov 6, 2024 •

edited

Loading