-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mimetype detected by libmagic is inaccurate #545
Comments
Okay, this turns out to be an interesting question, but perhaps more about The thing is, a As far as I can tell, a zip archive has no mechanism for specifying the content type of the overall archive. The question remains though: "How can I think the answer lies in exploring how file does what it does in this case. I expect it's sensitive to the |
And to add another interesting twist to this:
So I'll try find a chance to figure this out. |
Here's the relevant source file: https://github.com/threatstack/libmagic/blob/1249b5cd02c3b6fb9b917d16c76bc76c862932b6/magic/Magdir/msooxml Apparently the file order is important. You can test this by renaming the first 4 characters of the third file (the first ascii text after the 3rd |
Good insight @xsduan, looks to me that you've put your finger right on it. Here's the code that would need to be changed: The calling code is a few lines above. Like Word, we write the content types part first, followed by the package rels file, but after that it's first-come-first-served for the remaining parts. I suppose you could do a sort on parts by name, descending if it meant a lot to you, maybe something like: ordered_parts = sorted(parts, lambda p: p.partname, reverse=True)
for part in ordered_parts:
... That would produce an ordering something like:
The other approach that might be more like what Word does is to have parts be a set (instead of the list they are now) and pop out Seems like a lot of trouble to go through though :) |
I'm having this same issue: ahupp/python-magic#208 |
Hi, any chance of this being fixed? We submit our generated files to a third party and they keep rejecting our files because they don't recognize them as word documents. :-( |
@pbzrpa I think you're going to be on your own for this for the moment. As a workaround you can reorder the members in the resulting Zip archive (.docx file) with something like this suggested by ChatGPT. Note this just arranges the "files" (members in Zip parlance) in alphabetical order. You'll need to figure out what order they need to be in to suit libmagic. It looks like it wants the import zipfile
# Input zip file and output zip file
input_zip_filename = 'input.zip'
output_zip_filename = 'sorted_output.zip'
# Open the input zip file
with zipfile.ZipFile(input_zip_filename, 'r') as input_zip:
# Get a list of member names and sort them alphabetically
member_names = input_zip.namelist()
sorted_member_names = sorted(member_names)
# Create a new output zip file
with zipfile.ZipFile(output_zip_filename, 'w', zipfile.ZIP_DEFLATED) as output_zip:
for member_name in sorted_member_names:
# Get the content of the member from the input zip
member_content = input_zip.read(member_name)
# Add the member to the output zip with the same content
output_zip.writestr(member_name, member_content)
print(f'Sorted zip archive saved as {output_zip_filename}') Please post a working version on the thread once you've worked it out so others can benefit :) |
Btw, here's the libmagic code/config for this filetype: https://github.com/file/file/blob/master/magic/Magdir/msooxml Looks to me like all you have to do is make sure the first item is one of:
But I'd inspect a few MS Word-created files (
|
@scanny In my case, the file that shows up as
|
Yeah, since writing that I've had occasion to look at a few different .docx files of various provenance, in particular ones produced by LibreOffice when converting from My conclusion is it's going to be unreliable to count on the ordering of the elements and that testing for the presence of certain members in any position is required. In particular I'm looking at one such converted document that Word opens just fine, and I think I would go with something like this: def is_docx(file: IO[bytes]) -> bool:
file.seek(0)
if not is_zip_archive(): # -- however that's efficiently managed --
return False
members = zip.members # -- however that's efficiently done --
if "[Content_Types].xml" not in members:
# -- then it's not an Open Packaging Convention (OPC) package --
return False
if "word/document.xml" in members:
return True
return False There could be some additional tests, but there is precious little else that a Word file must have, probably |
If I create file in Microsoft Word (Mac, version 16.17) and then check its mimetype using libmagic on the command line I get the following output on Mac
However, files created by python-docx get a different mimetype
This affected me in some testing code that wanted to check that the document that was output was the correct type.
Can anyone give insight into this?
The text was updated successfully, but these errors were encountered: