Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text artifacts on sites not using UTF-8 (#7) #87

Open
necaran opened this issue Aug 30, 2024 · 25 comments
Open

Text artifacts on sites not using UTF-8 (#7) #87

necaran opened this issue Aug 30, 2024 · 25 comments
Labels
bug Something isn't working Critical

Comments

@necaran
Copy link

necaran commented Aug 30, 2024

When encoding is not UTF-8, the page goes wrong.
It does not only break the text but also page content somethings.

Website: https://www.jalan.net
Encoding: Shift_JIS

crop

Full screenshots
OriginalUltimaDark
@SagXD
Copy link

SagXD commented Aug 31, 2024

#7

@Vintagemotors Vintagemotors changed the title Page broken when encoding is not UTF-8 Text artifacts on sites not using UTF-8 (#7) Sep 2, 2024
@necaran
Copy link
Author

necaran commented Sep 10, 2024

I just found that broken/incomplete content might a different issue.
It breaks scripts at some websites regardless of the encoding.
For example, https://hub.docker.com (UTF-8) was stuck at the spinning icon and completely unusable.

Uncaught TypeError: c is undefined
    rentacarChk https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:2051
    setPeriod https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1205
    setTimeout handler*setPeriod https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1176
    refreshStayLength https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1073
    setTimeout handler*refreshStayLength https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1072
    returnMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1033
    setTimeout handler*returnMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:1014
    deptDateOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:992
    setTimeout handler*deptDateOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:980
    deptMonthOnChange https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:938
    init https://www.jalan.net/jalan/doc/top/js/dp2_d_home.js?update=20191105:172
    jQuery 11
dp2_d_home.js:2051:16
Uncaught TypeError: S is null
    rZ https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    rZ https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    Rl https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    no https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    no https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:33
    iL https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    o https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    aL https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    p https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    ee https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-27NNUPQV.js:37
    p https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-KYJ4F7QJ.js:1
    <anonymous> https://d36jcksde1wxzq.cloudfront.net/hub-ui/_shared/chunk-NB7LPVG4.js:4
chunk-27NNUPQV.js:33:23374

@ThomazPom
Copy link
Owner

Thank you for taking the time to test version 1.5.54, which has been released but not yet published. You identified the issue with hub.docker.com that was introduced in this version before it could be published.

@Vintagemotors Vintagemotors added bug Something isn't working Critical labels Sep 11, 2024
@chmichael
Copy link

chmichael commented Oct 8, 2024

@ThomazPom
Copy link
Owner

This bug is one of the oldest in UltimaDark, but I've recently made a promising breakthrough in understanding its root cause. All affected sites share a common factor: their pages are being edited by JavaScript. However, these JavaScript files and the pages themselves are encoded using a charset other than UTF-8.

To modify the page, UltimaDark decodes it from the source's detected charset and re-encodes it to UTF-8, which is the only charset supported by the TextEncoder. The problem arises when the page, now assuming everything is UTF-8, tries to decode included scripts as UTF-8, leading to encoding issues.

I think there is two valid solutions here :

  1. A potential solution would be to intercept the scripts and re-encode them to UTF-8 before they are executed exactly as we already do for the webpage.

  2. Alternatively, we could restore the original charset of each script in the script tag, as outlined here: Script charset attribute.

@chmichael
Copy link

I think the 2nd is the fastest ?

@ThomazPom
Copy link
Owner

It is true, but the charset attribute is deprecated. Using it would require searching through pages with incorrect base encodings, applying this deprecated attribute to scripts. Additionally, we would need to monitor the page for all the dozen of different methods a web developer could use to add a script, such as createElement, cloneNode, and insertAfter. UltimaDark already has similar mechanisms in place but it's always a pain to cover them all.

Solution 1, implemented in version 1.5.70, is triggered only when encoding mismatches are detected, significantly reducing the need for re-encoding. The re-encoding process itself is not resource-intensive enough to be noticeable in terms of performance.

Take a look at
Commit

@chmichael
Copy link

Well just tested 1.5.70 and the fix didn't work.

@necaran
Copy link
Author

necaran commented Oct 10, 2024

With solution 1, the page sends requests in wrong encoding back to the server (via HTML form or XHR).
For example if I search for 東京 (Tokyo), it returns the result for 譚ア莠ャ.
e2

@ThomazPom
Copy link
Owner

The same result would have occurred with either solution.
I wonder if i need to also re encode XHR calls and if it will get the same result.
I feel there might be dozens of edge cases. I'm looking forward for solutions.

@necaran
Copy link
Author

necaran commented Oct 10, 2024

I have not dug into the codes, but uBlock Origin (scriptlet injection, resource redirects) and userscript loaders such as Violentmonkey inject scripts without messing the charset.
At a first glance, both uBlock Origin and Violentmonkey use TextEncoder but they do not touch the charset of the original page.

@ThomazPom
Copy link
Owner

1.5.71:
I’ve fixed the issue with pages using non-UTF-8 charsets. Instead of forcing the page into UTF-8, which caused all kinds of issues, I now just let the page keep its original charset. I’m using a hacky but effective method to still write in UTF-8 without messing with the page’s encoding. This avoids the trouble of re-encoding into incompatible charsets. There’s still a potential issue if the page explicitly uses the charset keyword in CSS (not verified), or the content property in CSS (not verified), but this should be rare enough not to worry about.

Let me know if the result is acceptable or completely off 😂😂😂.

@Vintagemotors
Copy link
Collaborator

jalan.net has a javascript error when trying to click one of the words/ phrases that it is suggesting you click to search and is also very slow to load. hyperhosting.gr/grdomains is blank (1.5.72)

@necaran
Copy link
Author

necaran commented Oct 17, 2024

Found some pages that are still garbled as of 1.5.72:

p.s. There is a UTF-8 page http://charset.7jp.net/jis0212.html on the same site and is correctly rendered.

@ThomazPom
Copy link
Owner

ThomazPom commented Oct 17, 2024

I've identified and resolved some bugs related to the new parsing method, which is now available in version 1.5.73.

While this update addresses templating issues for the following sites:

hyperhosting.gr
jalan.net

It does not resolve charset issues, which will require more in-depth investigation at this time.

@ThomazPom
Copy link
Owner

Release v1.5.74 available now fixed more templating issues and more charset issues
http://charset.7jp.net/ (Shift_JIS)
http://charset.7jp.net/euc.html (EUC-JP)
http://charset.7jp.net/jis.html (ISO-2022-JP)

@necaran
Copy link
Author

necaran commented Oct 18, 2024

Thank you for the hard work! Now they are readable.
But there are still some glitches.

  • https://www.jalan.net/
    Some scripts are not working.
    If you click a keyword from the popup panel, the keyword should be added to the input, but with UltimaDark it is not.
    In addition, scripts to load banners are not working (uBlock and Tracking Protection disabled)
<script type="text/template" id="jsi-template-feed-banner">
  <a href="<%= link %>" class="dp_long_bnr3">
    <img src="<%= images[0] %>" alt="">
  </a>
  <a href="<%= link %>" class="dp_long_bnr2">
    <img src="<%= images[1] %>" alt="">
  </a>
</script>

jalan net

  • https://www.4gamer.net/
    It use @charset in CSS and pseudo-elements for text content.
    記事 X Postランキング becomes 鐃緒申鐃緒申 X Post鐃緒申鵐⑤鐃� on the page.
@charset "EUC-JP";

...

.tweet_ranking_container:before,
.tweet_ranking_container.normal:before{
	width:160px;
	margin-left:-80px;
	content:"記事 X Postランキング";
}

4gamer net

@ThomazPom
Copy link
Owner

ThomazPom commented Oct 20, 2024

We will have to find a way to reencode properly in all textdecoder supported charsets which can be any of the following values since the workaround does not fully works as expected:

That's a lot of work .

https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding
The recommended encoding for the Web: 'utf-8'.
The legacy single-byte encodings: 'ibm866', 'iso-8859-2', 'iso-8859-3', 'iso-8859-4', 'iso-8859-5', 'iso-8859-6', 'iso-8859-7', 'iso-8859-8''`, 'iso-8859-8i', 'iso-8859-10', 'iso-8859-13', 'iso-8859-14', 'iso-8859-15', 'iso-8859-16', 'koi8-r', 'koi8-u', 'macintosh', 'windows-874', 'windows-1250', 'windows-1251', 'windows-1252', 'windows-1253', 'windows-1254', 'windows-1255', 'windows-1256', 'windows-1257', 'windows-1258', or 'x-mac-cyrillic'.
The legacy multi-byte Chinese (simplified) encodings: 'gbk', 'gb18030'.
The legacy multi-byte Chinese (traditional) encoding: 'big5'.
The legacy multi-byte Japanese encodings: 'euc-jp', 'iso-2022-jp', 'shift-jis'.
The legacy multi-byte Korean encodings: 'euc-kr'.
The legacy miscellaneous encodings: 'utf-16be', 'utf-16le', 'x-user-defined'.
A special encoding, 'replacement'. This decodes empty input into empty output and any other arbitrary-length input into a single replacement character. It is used to prevent attacks that mismatch encodings between the client and server. The following encodings also map to the replacement encoding: ISO-2022-CN, ISO-2022-CN-ext, 'iso-2022-kr', and 'hz-gb-2312'.

@ThomazPom
Copy link
Owner

ThomazPom commented Oct 26, 2024

I've dedicated significant effort to this issue and discovered methods to rencode each of these encodings. Since browsers only support UTF-8 re-encoding, I created an algorithm that observes decoded values, learns from them, and maps characters to their byte values from the original data.

During re-encoding, the algorithm places the mapped byte values for each charset back into the data stream, ensuring accuracy for each character. You can try version 1.5.75.

Some charset might be broken, but this is a good proof of concept

@ThomazPom
Copy link
Owner

ThomazPom commented Oct 27, 2024

jalan.net has a javascript error when trying to click one of the words/ phrases that it is suggesting you click to search and is also very slow to load. hyperhosting.gr/grdomains is blank (1.5.72)

The new re-encoding algorithm correctly renders jalan.net . About hyperhosting.gr/grdomains, the issue was related to a templating quirk I’ve just fixed in version 1.5.76.

@Vintagemotors
Copy link
Collaborator

1.5.71 to 1.5.76 (current) were dedicated to fixing non utf8 websites. Fonts and CSSes are now loaded in their original encoding, eliminating the content keyword issue.

Does this mean we can consider non UTF-8 related encoding issues resolved?

@necaran
Copy link
Author

necaran commented Nov 4, 2024

CSS issues as addressed in #87 (comment) are still present (https://www.4gamer.net/).
There are also glitches on https://www.jalan.net/yad348331/ where a custom font and escape codes are used to show icons.

.ji-activity::before{content:"\F001";color:#FFF}
  • With UltimarDark (as shown by devtools)
.ji-activity::before { content: "��"; color: rgb(255, 255, 255); }
OriginalUltimaDark

p.s.
This is off topic but I found that HTML4 font color is not showing correctly.
Variable names for [ud-html4-support] elements are mismatching.

<font face="cursive,serif" size="4" style="--ud-html4-color: #ffb84d;" ud-html4-support="true">The HTML font tag is now deprecated. You should use <a href="/css/properties/css_font.cfm" target="_blank">CSS font</a> to set font properties instead.</font>
[ud-html4-support] { color: var(--ud-fg--ud-html4-color,rgba(255,255,255,1)) !important; }

@ThomazPom
Copy link
Owner

ThomazPom commented Nov 5, 2024

Hello @Vintagemotors, @chmichael, @necaran, and @SagXD,

Note: The following information applies specifically to non-UTF-8 pages. While these pages are less common, they still exist on the web.

@Vintagemotors, the UltimaDark embedded reencoder relies on its built-in decoder's ability to learn. This capability, in its standard mode, requires configuration for each charset, which you can find here and maps encoding bytecount to character codepoitn ranges . If there is misalignement in this map, adjustments are needed, and we might notice diamond question marks or incorrect characters in the HTML output.

Example

The standard decoding mode typically takes 7 to 15 ms to process the heaviest pages.

Additionally, the UltimaDark decoder has a dynamic mode, which operates without needing specific configurations. However, it requires 20 to 30 ms for decoding the heaviest non-UTF-8 pages. I think it would be beneficial to enable this mode temporarily for your feedback. You can activate it until the browser restarts by pasting uDarkDecodeSimple=dynamicDecoderCharacterCounter into the UltimaDark Inspect console under about:debugging#/runtime/this-firefox.

As @necaran pointed out, decoding issues can sometimes stem from the less common @charset CSS attribute, which allows a file to declare a different charset mid-way through the code. I have identified a technique to handle these cases, but it hasn't been implemented yet.

@necaran, thank you for catching that the HTML4 support mode broke! I'll fix it soon. Looks like a mistake.

@Vintagemotors, I've made significant advancements in darkening techniques, which required me to recode large and core sections of UltimaDark. The new structure aims to make the code more accessible for other developers, without compromising on quality. Due to the scale of these changes, there may be some minor, resolvable regressions on a few websites.

Thank you all for your continued support and feedback!

@Vintagemotors
Copy link
Collaborator

Vintagemotors commented Nov 6, 2024

Your new dynamic decoder mode does not appear to correct the remaining question mark characters - at least in this case. I will however continue investigating how hard it would be to implement test site(s) to validate these alternate encodings more thoroughly than relying on organically encountering issues. Also the issue with UltimaDark tampering with the site cache and it failing to be cleared by ctrl f5 is still present as it happened on jalan.net (remains even with the extension disabled and when installing a different version and refreshing).

1st example
Original (Corrupted cache)UltimaDark with and without dynamic decoderOriginal

Refreshing the console before or after sending uDarkDecodeSimple=dynamicDecoderCharacterCounter does not appear to influence at least these results though as necaran mentioned these cases are probably CSS related.

2nd example

image
image

@ThomazPom
Copy link
Owner

ThomazPom commented Nov 6, 2024

I forgot to mention that clicking on the refresh button of the Ultimadark console is mandatory before overriding the mode to dynamic, as it could been polluted by the normal mode’s bad learning from configurations.
Using this refresh button can be handy as it also restore standard mode as decoding method without restarting browser until pasting the dynamic override.

The issues can also arise due to CSS using @charset, especially for floating text like here, since it’s typical of the content CSS keyword. As said, these cases are not covered yet
Typicaly, if there is a @charset déclaration before this:
.ji-activity::before{content:"\F001";color:#FFF} as @necaran noticed.

I think the dynamic mode works on https://www.4gamer.net/ where it not use css content keyword

About testing a lot of URLs, my method is as follows:

1.	Decode (step 1)
2.	Re-encode
3.	Decode again (step 2)
4.	Compare the results from step 1 and step 2.

If the re-encoding is incorrect, the automatic comparison will detect it. I can also provide a test function if needed.

Also nex

‘Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Critical
Projects
None yet
Development

No branches or pull requests

5 participants