Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimetype detected by libmagic is inaccurate #545

Open
doronlh opened this issue Sep 12, 2018 · 10 comments
Open

Mimetype detected by libmagic is inaccurate #545

doronlh opened this issue Sep 12, 2018 · 10 comments
Labels

Comments

@doronlh
Copy link

doronlh commented Sep 12, 2018

If I create file in Microsoft Word (Mac, version 16.17) and then check its mimetype using libmagic on the command line I get the following output on Mac

$ file --mime file_created_in_word.docx
application/vnd.openxmlformats-officedocument.wordprocessingml.document

However, files created by python-docx get a different mimetype

$ file --mime file_created_by_python-docx.docx
application/octet-stream

This affected me in some testing code that wanted to check that the document that was output was the correct type.
Can anyone give insight into this?

@scanny
Copy link
Contributor

scanny commented Sep 12, 2018

Okay, this turns out to be an interesting question, but perhaps more about file.

The thing is, a .docx file is a zip archive with extension .docx instead of .zip. application/octet-stream is a reasonable MIME-type for a zip archive and maybe the default as far as file is concerned.

As far as I can tell, a zip archive has no mechanism for specifying the content type of the overall archive. The question remains though: "How can file distinguish a Word-created .docx file from another zip archive also containing a valid Word document?"

I think the answer lies in exploring how file does what it does in this case. I expect it's sensitive to the .docx extension, but then perhaps looks for an additional magic cookie somewhere or something. If you can discover what that is I'll consider adding it as a feature of python-docx for this purpose.

@doronlh
Copy link
Author

doronlh commented Sep 13, 2018

And to add another interesting twist to this:
if you run the command without the --mime switch you get the following:

$ file file_created_by_word.docx
Microsoft Word 2007+
$ file file_created_by_python-docx.docx
Microsoft OOXML

So file does recognise a python-docx produced file as a Microsoft Office file, effectively looking inside the contents of the zip.

I'll try find a chance to figure this out.

@xsduan
Copy link

xsduan commented Sep 15, 2018

Here's the relevant source file: https://github.com/threatstack/libmagic/blob/1249b5cd02c3b6fb9b917d16c76bc76c862932b6/magic/Magdir/msooxml

Apparently the file order is important. You can test this by renaming the first 4 characters of the third file (the first ascii text after the 3rd PK\x03\x04 in the file) to word/, and it'll register correctly as word 2007. python-docx seems to put docProps/core.xml as the 3rd file (because it puts all the files in alphabetical order).

@scanny
Copy link
Contributor

scanny commented Sep 15, 2018

Good insight @xsduan, looks to me that you've put your finger right on it.

Here's the code that would need to be changed:
https://github.com/python-openxml/python-docx/blob/master/docx/opc/pkgwriter.py#L47-L56

The calling code is a few lines above. Like Word, we write the content types part first, followed by the package rels file, but after that it's first-come-first-served for the remaining parts.

I suppose you could do a sort on parts by name, descending if it meant a lot to you, maybe something like:

ordered_parts = sorted(parts, lambda p: p.partname, reverse=True)
for part in ordered_parts:
    ...

That would produce an ordering something like:

  • word/webSettings.xml
  • word/theme/theme1.xml
  • word/stylesWithEffects.xml
  • word/styles.xml
  • word/settings.xml
  • word/header1.xml
  • word/footnotes.xml
  • word/footer1.xml
  • word/fontTable.xml
  • word/endnotes.xml
  • word/document.xml
  • word/_rels/document.xml.rels
  • docProps/thumbnail.jpeg
  • docProps/core.xml
  • docProps/app.xml

The other approach that might be more like what Word does is to have parts be a set (instead of the list they are now) and pop out word/_rels/document.xml.rels first and then word/document.xml and then let the rest be just as they come.

Seems like a lot of trouble to go through though :)

@kyprifog
Copy link

kyprifog commented Sep 1, 2020

I'm having this same issue: ahupp/python-magic#208

@pbzrpa
Copy link

pbzrpa commented Sep 5, 2023

Hi, any chance of this being fixed? We submit our generated files to a third party and they keep rejecting our files because they don't recognize them as word documents. :-(

@scanny
Copy link
Contributor

scanny commented Sep 6, 2023

@pbzrpa I think you're going to be on your own for this for the moment. As a workaround you can reorder the members in the resulting Zip archive (.docx file) with something like this suggested by ChatGPT. Note this just arranges the "files" (members in Zip parlance) in alphabetical order. You'll need to figure out what order they need to be in to suit libmagic. It looks like it wants the [Content_Types].xml to appear first, but I would inspect a few MS-Word-created documents and see if you can discern the desired pattern:

import zipfile

# Input zip file and output zip file
input_zip_filename = 'input.zip'
output_zip_filename = 'sorted_output.zip'

# Open the input zip file
with zipfile.ZipFile(input_zip_filename, 'r') as input_zip:
    # Get a list of member names and sort them alphabetically
    member_names = input_zip.namelist()
    sorted_member_names = sorted(member_names)

    # Create a new output zip file
    with zipfile.ZipFile(output_zip_filename, 'w', zipfile.ZIP_DEFLATED) as output_zip:
        for member_name in sorted_member_names:
            # Get the content of the member from the input zip
            member_content = input_zip.read(member_name)
            
            # Add the member to the output zip with the same content
            output_zip.writestr(member_name, member_content)

print(f'Sorted zip archive saved as {output_zip_filename}')

Please post a working version on the thread once you've worked it out so others can benefit :)

@scanny
Copy link
Contributor

scanny commented Sep 6, 2023

Btw, here's the libmagic code/config for this filetype: https://github.com/file/file/blob/master/magic/Magdir/msooxml

Looks to me like all you have to do is make sure the first item is one of:

  • [Content_Types].xml
  • _rels
  • \.rels
  • docProps
  • customXml

But I'd inspect a few MS Word-created files (unzip -l document.docx on mac or Linux) and see if you can determine the pattern and copy that. Looks like it might be just [Content_Types].xml comes first, here's an example:

:) ± unzip -l blk-containing-table.docx                                                                                                                                                 !52827
Archive:  blk-containing-table.docx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1474  01-01-1980 00:00   [Content_Types].xml
      735  01-01-1980 00:00   _rels/.rels
      953  01-01-1980 00:00   word/_rels/document.xml.rels
     2491  01-01-1980 00:00   word/document.xml
     7643  01-01-1980 00:00   word/theme/theme1.xml
    11616  01-01-1980 00:00   docProps/thumbnail.jpeg
     2380  01-01-1980 00:00   word/settings.xml
      431  01-01-1980 00:00   word/webSettings.xml
    16511  01-01-1980 00:00   word/stylesWithEffects.xml
      749  01-01-1980 00:00   docProps/core.xml
    15645  01-01-1980 00:00   word/styles.xml
     2023  01-01-1980 00:00   word/fontTable.xml
      731  01-01-1980 00:00   docProps/app.xml
---------                     -------
    63382                     13 files

@FeeeeK
Copy link

FeeeeK commented Oct 3, 2023

@scanny In my case, the file that shows up as Microsoft Word 2007+ has this structure:

[Content_Types].xml
_rels/.rels
word/document.xml
word/_rels/document.xml.rels
word/media/image1.jpeg
.....

word/_rels/document.xml.rels comes after word/document.xml, so maybe only [Content_Types].xml and _rels/.rels are needed to get the right type.
(File with a completely different order is displayed as a zip archive btw )

@scanny
Copy link
Contributor

scanny commented Oct 3, 2023

Yeah, since writing that I've had occasion to look at a few different .docx files of various provenance, in particular ones produced by LibreOffice when converting from .odt and .doc format.

My conclusion is it's going to be unreliable to count on the ordering of the elements and that testing for the presence of certain members in any position is required.

In particular I'm looking at one such converted document that Word opens just fine, and [ContentTypes].xml appears last among the members.

I think I would go with something like this:

def is_docx(file: IO[bytes]) -> bool:
    file.seek(0)
    if not is_zip_archive():  # -- however that's efficiently managed --
        return False
    members = zip.members  # -- however that's efficiently done --
    if "[Content_Types].xml" not in members:
        # -- then it's not an Open Packaging Convention (OPC) package --
        return False
    if "word/document.xml" in members:
        return True
    return False

There could be some additional tests, but there is precious little else that a Word file must have, probably _rels/.rels and maybe _word/_rels/document.xml.rels.

@scanny scanny added the opc label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants