Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File Handling for Non-English Alphabets #1029

Open
elbre opened this issue Dec 4, 2023 · 7 comments
Open

File Handling for Non-English Alphabets #1029

elbre opened this issue Dec 4, 2023 · 7 comments

Comments

@elbre
Copy link

elbre commented Dec 4, 2023

Hello,

I would like to bring attention to an issue I encountered while working with Droid, specifically when dealing with files containing characters from alphabets other than English.

Initially, we suspected that the problem might be related to using *.zip files. However, after further investigation, we observed similar issues when generating *.7z files and attempting different export methods.

To assist in resolving this matter, I am attaching the original files, the export, and a screenshot of the application to provide a comprehensive overview.

I am curious to know if there are plans to address these issues in the near future or if there is already a known solution?

czechfilesfail

aaa.txt
Jindřich Šťovíček.zip

@sparkhi
Copy link
Collaborator

sparkhi commented Dec 4, 2023

Hi @elbre
Thanks for raising the issue and attaching the supporting documents.
Unfortunately, there is no already known solution for it as of now. I'll investigate further and update.

Regards,

@ross-spencer
Copy link

Just to note, a workaround might be to unzip the contents here:

image

Which might point to there being something in the archive handling process in general causing the issue. I'm not sure. It's actually the same pattern in Siegfried too cc. @richardlehane.

e.g.

Without extracting:

---
filename : 'Jindrich.Stovicek.zip#Jind²ich µ£ovíƒek/ⁿíƒansk∞ k²iτ£ál/ⁿí'
filesize : 0
modified : 2023-11-29T21:59:26Z
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    class   : 
    basis   : 
    warning : 'no match'
---
filename : 'Jindrich.Stovicek.zip#Jind²ich µ£ovíƒek/µt╪σátko/µ'
filesize : 0
modified : 2023-11-29T22:00:50Z
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'UNKNOWN'
    format  : 
    version : 
    mime    : 
    class   : 
    basis   : 
    warning : 'no match'

With extracting:

---
filename : 'Říčanský křišťál/Říčanský křišťál.txt'
filesize : 0
modified : 2023-11-29T21:59:26+01:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/111'
    format  : 'Plain Text File'
    version : 
    mime    : 'text/plain'
    class   : 
    basis   : 'extension match txt'
    warning : 'match on extension only'
---
filename : 'Štěňátko/Štěňátko.txt'
filesize : 0
modified : 2023-11-29T22:00:50+01:00
errors   : 'empty source'
matches  :
  - ns      : 'pronom'
    id      : 'x-fmt/111'
    format  : 'Plain Text File'
    version : 
    mime    : 'text/plain'
    class   : 
    basis   : 'extension match txt'
    warning : 'match on extension only'

Was just interested to take a look at this as we had problems with earlier DROID releases with the Māori language character set, but I had thought they were resolved. I guess we didn't process a lot of zips back in the day!

@elbre
Copy link
Author

elbre commented Dec 4, 2023

"Thank you for the workaround. Unfortunately, we are working on a workflow where ZIP files should also be acceptable."

@richardlehane
Copy link

I did a little testing with this today. It looks like the file names within the zip file aren't UTF-8 or IBM437 (the default in the zip spec), but rather have the character encoding IBM852. I'm not really sure how you'd go about reliably detecting this during unzipping (though tools like 7-zip and WinZip seem to manage it so perhaps it is possible?):

image

@elbre
Copy link
Author

elbre commented Dec 6, 2023

I can also provide material made in command line:
7z a -tzip -scsUTF-8 archiv.zip "Jindřich Šťovíček"
7-Zip 23.01 (x64) for Windows

archiv.zip

@richardlehane
Copy link

I can also provide material made in command line: 7z a -tzip -scsUTF-8 archiv.zip "Jindřich Šťovíček" 7-Zip 23.01 (x64) for Windows

archiv.zip

archiv.zip still contains non-UTF-8 filenames. Try the -mcu flag instead...

7z a -tzip -mcu archiv.zip "Jindřich Šťovíček"

@elbre
Copy link
Author

elbre commented Dec 11, 2023

Good day once again.
I would like to thank you for your comments on this matter, especially regarding the -mcu flag. I was informed that this parameter is missing in the documentation. At this point, I have been told that we can proceed with our project using the information you have provided so far, and from our perspective, the issue can be considered closed.

However, as mentioned earlier, the originally provided source is the default method for creating zip files, and it is highly probable in our region to encounter these files. Therefore, I would prefer to keep the issue open."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants