docx created with word online #16

burbma · 2017-10-05T17:33:33Z

Line 87 in f71c423

doc_xml = 'word/document.xml'

If I create a docx in SharePoint it takes me to Word Online. I add some text and it saves automatically. Then I download the file.

Now I do the following:

import zipfile
zip = zipfile.ZipFile('path/to/file.docx')
xml = zip.read('word/document.xml')

This fails with KeyError: "There is no item named 'word/document.xml' in the archive"

There is, however, a 'word/document2.xml' which contains (at least for my one trial case) the same as 'word/document.xml'. I discovered this by opening 'path/to/file.docx' in actual Microsoft Word on my local machine and then saving the file. NOW when I do zip.read('word/document.xml') the xml file is there as expected.

I really don't know much about this stuff or why creating a file with Word Online appears to create something different then local Word. Thus I don't know what the best solution is. It seems hack-ish to just put a line in the code that says if you can't find 'word/document.xml' look for 'word/document2.xml' but maybe that's all we need. Let me know.

The text was updated successfully, but these errors were encountered:

SamMorrowDrums · 2017-11-22T10:32:14Z

@mmb90 the only alternative I can see is to look for an item in the list of documents in the zipfile that matches word/\d?document.xml but I think that frankly what I've done is enough. Stupid inconsistent format. There's probably some spec defined way too, but I really don't care enough to put more time in.

fjouault · 2018-01-11T13:29:59Z

Actually, it seems that _rels/.rels (an XML file zipped in the docx) contains the name of the document file as the Target attribute of the Relationship element with Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument".

For instance:
<Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="/word/document2.xml"/>

SamMorrowDrums · 2018-03-19T12:39:11Z

Yep, @fjouault I've updated my PR to reflect this. As far as I can tell this solution is robust. I'd appreciate any help with manually testing it, but hopefully this is ready to merge (or at least very close).

ShayHill · 2019-07-18T16:07:49Z

For what it's worth, I've come across a document where the rels file is wrong.

_rels/.rels correctly identifies officeDocument as "word/document.xml"

HOWEVER

word/_rels/document.xml.rels identifies header and footer as "word/header.xml" and "word/footer.xml" (not "header.xml" and "footer.xml" as they should be). I don't know the history of this file, and it's only 1 of 6000 I'm testing with, but the potential appears to be there. I would like to post the file, but removing proprietary information then re-saving corrects the problem.

wendywangwwt · 2020-09-04T21:18:52Z

I'm getting this issue and a temporary fix I created is as follows:

path = '/opt/conda/envs/Python-3.6-WMLCE/lib/python3.6/site-packages/docx2txt/docx2txt.py'
with open(path,'r') as f:
    script = f.readlines()

with open(path,'w') as f:
    for i,line in enumerate(script):
        if i == 86:
            line = "    doc_xml = [re.findall('(word\/document.*)',fn)[0] for fn in filelist if len(re.findall('(word\/document.*)',fn)) > 0][0]\n"
        f.write(line)

Basically I replace the hard coded xml (line 87) with a regular expression search as highlighted above. We are doing it in this way because the environment is containerized so every time we need to reinstall the package and change this line. For those who run this in their own long-lasting environment, simply replace line 87 with the following:

doc_xml = [re.findall('(word\/document.*)',fn)[0] for fn in filelist if len(re.findall('(word\/document.*)',fn)) > 0][0]\n

It's definitely not perfect and can be improved.. but anyway it solves my problem :)

SamMorrowDrums mentioned this issue Nov 20, 2017

hack around inconsistent format #18

Closed

schmamps mentioned this issue Aug 17, 2018

big ball of mud addressing several issues #22

Open

danielrbrowne mentioned this issue Feb 6, 2019

Trivial docx file fails to be parsed with 'couldn't parse docx file' error jgm/pandoc#5277

Closed

markfullmer mentioned this issue Feb 26, 2020

Error converting .docx files with rel writecrow/corpus_text_processor#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docx created with word online #16

docx created with word online #16

burbma commented Oct 5, 2017

SamMorrowDrums commented Nov 22, 2017

fjouault commented Jan 11, 2018 •

edited

Loading

SamMorrowDrums commented Mar 19, 2018

ShayHill commented Jul 18, 2019

wendywangwwt commented Sep 4, 2020

docx created with word online #16

docx created with word online #16

Comments

burbma commented Oct 5, 2017

SamMorrowDrums commented Nov 22, 2017

fjouault commented Jan 11, 2018 • edited Loading

SamMorrowDrums commented Mar 19, 2018

ShayHill commented Jul 18, 2019

wendywangwwt commented Sep 4, 2020

fjouault commented Jan 11, 2018 •

edited

Loading