Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimdump fails on long URLs #213

Open
rgaudin opened this issue Jan 13, 2021 · 8 comments
Open

zimdump fails on long URLs #213

rgaudin opened this issue Jan 13, 2021 · 8 comments

Comments

@rgaudin
Copy link
Member

rgaudin commented Jan 13, 2021

here's the tail of the output of zimdump dump --dir /data/mqc somezim.zim:

Error writing file to errors dir. /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3
Exception: Error writing file to errors dir. /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3

Touching that file fails as well

touch: /data/mqc/_exceptions/H%2fs0.wp.com%2f_static%2f??-eJylUu1uwyAMfKERr9VaLT+mPQsEj9HwJTCL8vZzEk3NGi2KtD+Iw3fmfABDEl0MhIHAV5FcNTYUGFIXvSjeOhwfUNOV8gQrmXLR3IUxa6kLGBeVdMe4NHBp5Lr4Om8UK1PO9ljghpRk14sZbeg%2fXFMZKsyGKxmhba7NGVS1Tk8eZrnKMo9QaHT4%2fzb0if5Im1m1GkKOsZIw2erDTh5aZEk2mPKHfBXfNAGf+yRp~3: File name too long

Very long URLs seems like a common use case and I believe it calls for a design change in the way those are written to disk.

Might be related to #190

Note: this is zimdump 2.1.0

@kelson42
Copy link
Contributor

kelson42 commented Mar 2, 2021

I propose to:

@mgautierfr
Copy link
Collaborator

mgautierfr commented Mar 2, 2021

truncate exception files/directory if they hare too long for the filesystem

Care must be taken when truncating directories.
We may have 3 entries :

  • "long_directory_xxxxxxxxxxyyyyyy/foo.html"
  • "long_directory_xxxxxxxxxxyyyyyy/bar.html"
  • "long_directory_xxxxxxxxxxzzzzzz/foo.html"

Both first directories must correctly truncated to the same "short name" ("long_directory~1") but the second must be different ("long_directory~2")

@kelson42 kelson42 added this to the 3.2.0 milestone Sep 25, 2022
@adamlamar
Copy link

Is there an example zim file available that has this problem?

@kelson42 kelson42 modified the milestones: 3.2.0, 3.3.0 Mar 22, 2023
@kelson42 kelson42 changed the title dump fails on long URLs zimdump fails on long URLs Aug 13, 2023
@kelson42 kelson42 modified the milestones: 3.3.0, 3.4.0 Sep 26, 2023
@2600box
Copy link

2600box commented Oct 11, 2023

Is there an example zim file available that has this problem?

Please go ahead with this one that is 3GB: https://www.transfernow.net/dl/20231009UhHnE3Sy

@benoit74
Copy link
Contributor

benoit74 commented Apr 4, 2024

We should probably consider at the same time to ignore / replace all characters that are not allowed / interpreted differently on the target filesystem, this is causing many files to not be dumped.

@benoit74
Copy link
Contributor

benoit74 commented Apr 4, 2024

Building a very small ZIM with many "strange" ZIM paths is probably the way to go, quite easy to do with python-libzim or python-scraperlib. This would make testing the change on many filesystems much easier.

@rgaudin
Copy link
Member Author

rgaudin commented Apr 4, 2024

Indeed but I'd like to mention that filesystems limitations are all properly documented. It should be designed with those limitations in mind as testing on various filesystems is cumbersome.

@kelson42 kelson42 modified the milestones: 3.4.1, 3.5.0 May 16, 2024
@kelson42 kelson42 modified the milestones: 3.4.2, 3.5.0 Jul 8, 2024
@kelson42 kelson42 modified the milestones: 3.5.0, 3.6.0 Aug 29, 2024
@ThisIsFineTM
Copy link

One option to work around this would be to hash the URLs, write the hash as the output filename, and maintain a mapping of hashes to URL in something like a sqlite database or a text file. This would have an added benefit of abstracting out filesystem edge cases but would add that extra layer of indirection in the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants