-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text artifacts on sites not using UTF-8 (#7) #87
Comments
I just found that broken/incomplete content might a different issue.
|
Thank you for taking the time to test version 1.5.54, which has been released but not yet published. You identified the issue with hub.docker.com that was introduced in this version before it could be published. |
Broken Text (Greek) |
This bug is one of the oldest in UltimaDark, but I've recently made a promising breakthrough in understanding its root cause. All affected sites share a common factor: their pages are being edited by JavaScript. However, these JavaScript files and the pages themselves are encoded using a charset other than UTF-8. To modify the page, UltimaDark decodes it from the source's detected charset and re-encodes it to UTF-8, which is the only charset supported by the TextEncoder. The problem arises when the page, now assuming everything is UTF-8, tries to decode included scripts as UTF-8, leading to encoding issues. I think there is two valid solutions here :
|
I think the 2nd is the fastest ? |
It is true, but the charset attribute is deprecated. Using it would require searching through pages with incorrect base encodings, applying this deprecated attribute to scripts. Additionally, we would need to monitor the page for all the dozen of different methods a web developer could use to add a script, such as createElement, cloneNode, and insertAfter. UltimaDark already has similar mechanisms in place but it's always a pain to cover them all. Solution 1, implemented in version 1.5.70, is triggered only when encoding mismatches are detected, significantly reducing the need for re-encoding. The re-encoding process itself is not resource-intensive enough to be noticeable in terms of performance. Take a look at |
Well just tested 1.5.70 and the fix didn't work. |
The same result would have occurred with either solution. |
I have not dug into the codes, but uBlock Origin (scriptlet injection, resource redirects) and userscript loaders such as Violentmonkey inject scripts without messing the charset. |
1.5.71: Let me know if the result is acceptable or completely off 😂😂😂. |
jalan.net has a javascript error when trying to click one of the words/ phrases that it is suggesting you click to search and is also very slow to load. hyperhosting.gr/grdomains is blank (1.5.72) |
Found some pages that are still garbled as of 1.5.72:
p.s. There is a UTF-8 page http://charset.7jp.net/jis0212.html on the same site and is correctly rendered. |
I've identified and resolved some bugs related to the new parsing method, which is now available in version 1.5.73. While this update addresses templating issues for the following sites: It does not resolve charset issues, which will require more in-depth investigation at this time. |
Release v1.5.74 available now fixed more templating issues and more charset issues |
Thank you for the hard work! Now they are readable.
<script type="text/template" id="jsi-template-feed-banner">
<a href="<%= link %>" class="dp_long_bnr3">
<img src="<%= images[0] %>" alt="">
</a>
<a href="<%= link %>" class="dp_long_bnr2">
<img src="<%= images[1] %>" alt="">
</a>
</script>
@charset "EUC-JP";
...
.tweet_ranking_container:before,
.tweet_ranking_container.normal:before{
width:160px;
margin-left:-80px;
content:"記事 X Postランキング";
}
|
We will have to find a way to reencode properly in all textdecoder supported charsets which can be any of the following values since the workaround does not fully works as expected: That's a lot of work . https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder/encoding |
I've dedicated significant effort to this issue and discovered methods to rencode each of these encodings. Since browsers only support UTF-8 re-encoding, I created an algorithm that observes decoded values, learns from them, and maps characters to their byte values from the original data. During re-encoding, the algorithm places the mapped byte values for each charset back into the data stream, ensuring accuracy for each character. You can try version 1.5.75. Some charset might be broken, but this is a good proof of concept |
The new re-encoding algorithm correctly renders jalan.net . About hyperhosting.gr/grdomains, the issue was related to a templating quirk I’ve just fixed in version 1.5.76. |
Does this mean we can consider non UTF-8 related encoding issues resolved? |
CSS issues as addressed in #87 (comment) are still present (https://www.4gamer.net/). .ji-activity::before{content:"\F001";color:#FFF}
.ji-activity::before { content: "��"; color: rgb(255, 255, 255); }
p.s.
<font face="cursive,serif" size="4" style="--ud-html4-color: #ffb84d;" ud-html4-support="true">The HTML font tag is now deprecated. You should use <a href="/css/properties/css_font.cfm" target="_blank">CSS font</a> to set font properties instead.</font> [ud-html4-support] { color: var(--ud-fg--ud-html4-color,rgba(255,255,255,1)) !important; } |
Hello @Vintagemotors, @chmichael, @necaran, and @SagXD, Note: The following information applies specifically to non-UTF-8 pages. While these pages are less common, they still exist on the web. @Vintagemotors, the UltimaDark embedded reencoder relies on its built-in decoder's ability to learn. This capability, in its standard mode, requires configuration for each charset, which you can find here and maps encoding bytecount to character codepoitn ranges . If there is misalignement in this map, adjustments are needed, and we might notice diamond question marks or incorrect characters in the HTML output. The standard decoding mode typically takes 7 to 15 ms to process the heaviest pages. Additionally, the UltimaDark decoder has a dynamic mode, which operates without needing specific configurations. However, it requires 20 to 30 ms for decoding the heaviest non-UTF-8 pages. I think it would be beneficial to enable this mode temporarily for your feedback. You can activate it until the browser restarts by pasting As @necaran pointed out, decoding issues can sometimes stem from the less common @necaran, thank you for catching that the HTML4 support mode broke! I'll fix it soon. Looks like a mistake. @Vintagemotors, I've made significant advancements in darkening techniques, which required me to recode large and core sections of UltimaDark. The new structure aims to make the code more accessible for other developers, without compromising on quality. Due to the scale of these changes, there may be some minor, resolvable regressions on a few websites. Thank you all for your continued support and feedback! |
Your new dynamic decoder mode does not appear to correct the remaining question mark characters - at least in this case. I will however continue investigating how hard it would be to implement test site(s) to validate these alternate encodings more thoroughly than relying on organically encountering issues. Also the issue with UltimaDark tampering with the site cache and it failing to be cleared by ctrl f5 is still present as it happened on jalan.net (remains even with the extension disabled and when installing a different version and refreshing). Refreshing the console before or after sending |
I forgot to mention that clicking on the refresh button of the Ultimadark console is mandatory before overriding the mode to dynamic, as it could been polluted by the normal mode’s bad learning from configurations. The issues can also arise due to CSS using @charset, especially for floating text like here, since it’s typical of the I think the dynamic mode works on https://www.4gamer.net/ where it not use css About testing a lot of URLs, my method is as follows:
If the re-encoding is incorrect, the automatic comparison will detect it. I can also provide a test function if needed. Also nex ‘Thank you! |
When encoding is not UTF-8, the page goes wrong.
It does not only break the text but also page content somethings.
Website: https://www.jalan.net
Encoding: Shift_JIS
Full screenshots
The text was updated successfully, but these errors were encountered: