Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

interpret filenames as UTF-8 even without general purpose bit 11 to workaround Mac bug #84

Open
imcuttle opened this issue May 4, 2018 · 15 comments

Comments

@imcuttle
Copy link

imcuttle commented May 4, 2018

File: 中文测试.zip

The zip file contains 中文测试.md,when I pass decodeStrings: true, the result is
image

when I pass decodeStrings: false, the error The "path" argument must be of type string be thrown.

@thejoshwolfe
Copy link
Owner

thejoshwolfe commented May 6, 2018

The problem seems to be that the filenames are encoded in UTF8, but general purpose bit 11 is not set. The zipfile claims the filenames are encoded with CP437, and in that encoding, the filename you're seeing is the correct interpretation. The zip file is expecting zipfile readers (like yauzl) to interpret the filename as UTF8 without being instructed to do so.

In other words, yauzl is behaving correctly, and the zipfile is malformed.

Do you know what program created this zip file?

@thejoshwolfe
Copy link
Owner

Is it Archive Utility on Mac?

@rossj
Copy link

rossj commented May 6, 2018

@imcuttle I have a need to handle similar not-so-standard .zip files in my application, and I wanted to share my heuristic solution.

If you only need to deal with this file and similar files that are always UTF-8 (even if they don't indicate this), you can use the decodeStrings: true option and convert them to strings yourself. Your The "path" argument must be of type string error is likely coming from some other code downstream that is expecting it to be a string. You probably need to do the Buffer -> string conversion before this point.

In my case, it is a bit more complicated, as I need to simultaneously handle zip files that are UTF-8 (with and without the proper bit being set), as well as files that are CP437 encoded. My solution is to use decodeStrings: false, collect all of the ZipEntries and fileName Buffers, and then to inspect these name Buffers to try and guess the proper encoding.

Specifically, I use the code in this gist to get some information on the name Buffers, followed by this logic:

const aggs = checkStringBufs(entries.map(entry => entry.fileName as Buffer));

let encoding: string;
if (aggs.allAsciiChar) {
    // utf8 is backwards compatible with ascii
    encoding = 'utf8';
} else if (aggs.all7Bit) {
    // Hmmm, no high bits but some control chars, probably cp437
    encoding = 'cp437';
} else if (aggs.validUtf8) {
    // Some high bits set, but seems to be UTF-8
    encoding = 'utf8';
} else {
    // Some high bits set, but not UTF-8!
    encoding = 'cp437';
}

This has been working well for the .zip files that I deal with.

@imcuttle
Copy link
Author

imcuttle commented May 7, 2018

@thejoshwolfe

Is it Archive Utility on Mac?

Yep, the zip file created by mac system, It's puzzled that the zip file is malformed.

image

@thejoshwolfe
Copy link
Owner

You'd think that Apple would be better at writing software, but their Archive Utility really sucks at zip files. I've been working around bugs in that software for years.

If this issue is as simple as "Archive Utility always forgets to set bit 11", then maybe yauzl should have better support for this situation. I'll think about this.

@linYeeTracy
Copy link

我也遇到了这个问题,请问这个问题解决了嘛?

@thejoshwolfe
Copy link
Owner

Sorry, I haven't been working on this project lately. I'll revisit this issue next week.

@imcuttle
Copy link
Author

I guess that isn't the wrong of yauzl.

when I passed option decodeStrings: false, filename of zip file by mac os could received normally.
See the pr

@thejoshwolfe
Copy link
Owner

when I passed option decodeStrings: false, filename of zip file by mac os could received normally.
See the pr

I believe the toString() call is using "utf8" encoding by default, which is the encoding intended by the zip file creator (Mac Archive Utility). On principle this isn't necessarily safe or correct, but in practice it's probably fine.

@thejoshwolfe
Copy link
Owner

I did some research into Info-ZIP's charset detection code, and in the absence of General Purpose Bit 11, Info-ZIP uses a different charset depending on the operating system. It will only use CP437 as required by the spec on some platforms, presumably DOS. However, on Linux and Mac, Info-ZIP will simply always use UTF-8 for decoding file paths, because UTF-8 is the "native" charset on those platforms, whatever that means. This suggests it's safe for yauzl to drop support for CP437 and just use UTF-8 in all situations as well. 🤔

@wizardpisces
Copy link

pr is rejected !! No clue about how to better deal with this issue, any progress ???

@avallete
Copy link

Any news on this issue. The problem still exist with OSX archives.

@Musicminion
Copy link

Musicminion commented Jun 22, 2023

最近我也遇到了一个问题,主要是Overleaf一直用的也是yauzl作为压缩包的处理方法。对于这种,我认为目前的解决方案就是:

  • 把自己的项目文件传到Github然后下载压缩包(开玩笑,但我确实是这样发现yauzl这时候又好了!)
  • Keka这个压缩工具,官方网站:https://www.keka.io/zh-cn/
    考虑到作者现在似乎基本没怎么维护这个项目了,这个历史遗留的问题只能看后续的开发者如何处理。MacOS确实在压缩的zip文件不是那么的标准,所以目前只能按照作者所说,这不是yauzl的问题,而是zip文件不标准的问题。
  • 我的个人压缩zip网站:https://musicminion.github.io/musicminion-tool/

Recently, I also encountered a similar issue with Overleaf, where I was using yauzl for handling compressed files. For such cases, I believe the current solution options are:

Considering that it seems the author is not actively maintaining the project anymore, this historical issue can only be addressed by future developers. MacOS indeed has some non-standard behaviors when it comes to zip files, so currently, we have to follow the author's statement that this is not a problem with yauzl but rather an issue with non-standard zip files.

@thejoshwolfe thejoshwolfe changed the title Chinese filename decode interpret filenames as UTF-8 even without general purpose bit 11 to workaround Mac bug Mar 7, 2024
@thejoshwolfe
Copy link
Owner

Hello everyone. Sorry for the long silence. Whenever Mac Archive Utility has a bug I'm supposed to work around, it's really demotivating. I suggest everyone who is interested in this issue getting addressed please file a bug report against Apple's Archive Utility to "Set General Purpose Bit 11 to indicate UTF-8 encoded file names". If you can't figure out how to file a bug report against Archive Utility, then you now understand some of my demotivation for working around their broken trashware.

If yauzl were to offer a workaround for this, it would mean that the authors of code calling into yauzl would need to know whether a zipfile was created by Apple or whether it was a conformant zipfile. Given that I, the human author of yauzl, while using a zip file analyzer tool that I made, still cannot reliably tell whether any given zip file was created by Archive Utility vs something else, I don't think it makes any sense for yauzl to offer a configuration option that requires this determination be made by programmers who are relying on yauzl to handle the quirks of the zip file format so they don't have to think about it.

The ideal support for working around this issue would be to have everything "just work" all the time, which means changing the default interpretation from CP437 to UTF-8, and then seeing how that breaks everyone's zip files. Presumably, this better supports OSX, and worse supports DOS. Seems like a reasonable tradeoff, but I'm a stickler for following the spec (it's literally the number 1 design principle of yauzl.), and allowing a megacorporate bully to shape defacto standards through incompetence feels really bad.

Good feelings are literally the funding keeping this volunteer project going, and currently this issue, issue #84, has no available funding behind it. Again, please file a bug report against Archive Utility if you can. And if you figure out how to do that, please do report back here. It would be very encouraging to find out that Apple is willing to listen to reports of the damage they are causing in the software world.

@vyv03354
Copy link

vyv03354 commented Jan 8, 2025

The ideal support for working around this issue would be to have everything "just work" all the time, which means changing the default interpretation from CP437 to UTF-8, and then seeing how that breaks everyone's zip files. Presumably, this better supports OSX, and worse supports DOS. Seems like a reasonable tradeoff, but I'm a stickler for following the spec (it's literally the number 1 design principle of yauzl.), and allowing a megacorporate bully to shape defacto standards through incompetence feels really bad.

Please don't. It will break gazillions of zip filenames from Windows. Windows built-in "Compressed zip folder" is still using OEM code page (437 on en_US, but it is not even 437 on some other locales). Windows had no GUI to create a zip file whose filename encoding is UTF-8 until very recently. Windows 11 24H2 finally switched to libarchive and started using UTF-8 filenames for new zip archives. But if you add a file to an existing zip archives, the older zip folder code will run and OEM code page will be used. Most Windows software (including libarchive) follows the older zip folder's behavior (i.e. assume OEM code page if EFS is not set).

The problem seems to be that the filenames are encoded in UTF8, but general purpose bit 11 is not set. The zipfile claims the filenames are encoded with CP437, and in that encoding, the filename you're seeing is the correct interpretation. The zip file is expecting zipfile readers (like yauzl) to interpret the filename as UTF8 without being instructed to do so.

Although the latest zip spec says that the filename encoding is code page 437 if EFS is not set, many apps do not obey that because the older (before 6.3.0) specs said nothing about the filename encoding and apps are required to be backward-compatible. If a user updated an app and filenames in existing zip archives are suddenly broken, the user will very likely to blame the app developer. So unfortunately you can't assume that the filename encoding is code page 437 if EFS is not set especially when the "version made by" field is less than 63.

For clarity, I'm not defending Apple. Rather, Windows users (especially on CJK locales) are suffered from broken filenames from Mac because Archive Utility uses UTF-8 without setting EFS and Windows apps assume that the filename encoding is OEM code page if EFS is not set. You're absolutely right about that we should file a bug against Apple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants