big ball of mud addressing several issues #22

schmamps · 2018-08-17T22:13:53Z

Yeah, there's a whole lot happening here. This started as just wanting the document properties, but the process ended up addressing issues and wrapping everything in a class to maintain backwards compatibility.

Issues/PRs Addressed

py3 support #11 - encode() is only called in Python < 3
How to differentiate between header text vs paragraph text? #15 and Image Paths in generated Text #21 - class DocxFile has .header, .footer,.main and .images properties
docx created with word online #16 - Text in header/main/footer text discovery via XML parsing instead of file matching
hack around inconsistent format #18 - incorporate lstrip() (although I haven't seen the problem)

Bonus Features

Documentation

The README is thorough.

Performance

Additional XML parsing overhead is more than offset by reducing the complexity of xml2text().

Correctness

Errors are output to stderr instead of print()ed.

add -d/--details flag for all document data reduce calls to qn() in xml2text add DocxFile class for object-type access add get_output() for script invocation

increment version no.

schmamps · 2018-08-17T22:15:55Z

docx2txt/docx2txt.py

-            text += '\t'
-        elif child.tag in (qn('w:br'), qn('w:cr')):
-            text += '\n'
-        elif child.tag == qn("w:p"):


qn() can be called as many as four times per node!

agusmba · 2019-02-06T16:40:44Z

docx2txt/docx_file.py

+
+    zipf = zipfile.ZipFile(path)
+    paths = {}
+    for fname in ['_rels/.rels', 'word/_rels/document.xml.rels']:


at least in my office365 docx, the second file is named document2.xml.rels so it might not be a good idea to hardcode this, but instead deduce its name from the content of the _rels/.rels file.

None of my files have this structure. I can't even fake it because there is no XML reference to word/_rels/document.xml.rels.

I could guess where that file is, but would prefer to know. A sample file would be very helpful because I don't have a 365 subscription.

If it helps, here you have a simple Word Online document.

SimpleDocument.zip

It was extremely helpful. Looks like 365 does this for no particular reason.

schmamps · 2019-02-13T22:19:28Z

OK, I get it now. That works but it's wrong, and so help us if Microsoft ever goes to document3.xml, slartibartfast.xml, or whatever. To simplify, The Right Way to find that second file is more like:

from os.path import basename, dirname

src_path = '_rels/.rels'
src_selector = '/Relationships/Relationship[Type=http://schemas…officeDocument]'
src_attr = 'Target'

doc_path = some_xml_function(src_path, src_selector, src_attr)
doc_rels = '{}/_rels/{}.rels'.format(dirname(doc_path), basename(doc_path))

I'll push that fix after accounting for another strange possibility: multiple nodes of Type officeDocument.

correctly locate document relationships nomenclature fixes refactor utility functions limit imports protect document properties remove header and footer from document text (it's different from page to page)

nomenclature

schmamps · 2019-02-15T20:44:59Z

After a shallow dive into Open Packaging Conventions, this looks pretty faithful to the standard.

JMBurley · 2019-06-20T15:24:37Z

This is pretty great & a significant functionality upgrade. @ankushshah89 Any chance of merging this into master?

Failing that, I'm willing to go and repackage the updated version and get it on PyPI. But I'd prefer to see original authors (incl. @schmamps ) get direct credit.

SamMorrowDrums · 2019-08-15T14:00:50Z

Yes I've had to use my fork (and open PR) version for years because office365 suport is not optional for me. I see there was one change recently by @ankushshah89 - with a version bump for a compatibility bug, but I'd love to know why various attempts to fix the support issue have not been merged.

If the maintainer(s) would like somebody else to get involved please do let us know because the compatibility issue is now so frequent that this repo is effectively obsolete, unless the issue is addressed.

It's almost 2 years since I attempted to fix the issue - and it's sad that there have been a few commits since, but the issue remains ignored.

We want to help and make this work, it's great having things like this and nobody want's to be rude or impatient, but also nobody wants the upstream version of a module to be dysfunctional for years without some kind of action being taken.

If we don't hear back from this then I certainly would back forking and publishing - out of desperation, rather than choice.

ShayHill · 2019-08-16T15:18:50Z

My 'fork' (docx2python) should work with these. If you have files that do not work, please forward them to me.

…

-------- Original message -------- From: Sam Morrow <[email protected]> Date: 8/15/19 9:00 AM (GMT-06:00) To: ankushshah89/python-docx2txt <[email protected]> Cc: Subscribed <[email protected]> Subject: Re: [ankushshah89/python-docx2txt] big ball of mud addressing several issues (#22) Yes I've had to use my fork (and open PR) version for months because office365 suport is not optional for me. I see there was one change recently by @ankushshah89<https://github.com/ankushshah89> - with a version bump for a compatibility bug, but I'd love to know why various attempts to fix the support issue have not been merged. If the maintainer(s) would like somebody else to get involved please do let us know because the compatibility issue is now so frequent that this repo is effectively obsolete, unless the issue is addressed. It's almost 2 years since I attempted to fix the issue - and it's sad that there have been a few commits since, but the issue remains ignored. We want to help and make this work, it's great having things like this and nobody want's to be rude or impatient, but also nobody wants the upstream version of a module to be dysfunctional for years without some kind of action being taken. If we don't hear back from this then I certainly would back forking and publishing - out of desperation, rather than choice. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#22?email_source=notifications&email_token=ADAKIE6B7LTMTIBPECLX74LQEVORLA5CNFSM4FQJBZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4L4K6I#issuecomment-521651577>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADAKIE6F4C3KITXDXKDUQ63QEVORLANCNFSM4FQJBZVQ>.

schmamps added 11 commits August 17, 2018 13:04

clear linter messages

a54c859

write errors to stderr

b1ffc88

add -d/--details flag for all document data reduce calls to qn() in xml2text add DocxFile class for object-type access add get_output() for script invocation

use get_output for invocation

ecf5d24

export get_output() for script invocation

bbc35a6

increment version no.

refactor DocxFile

2cb5a30

export DocxFile

ba3fd0b

get image filenames if not extracting

e140f48

full documentation

a221843

clarify unqualify namespace

4250f17

formatting

8272a65

clarify dict.get() mention

8593a38

schmamps commented Aug 17, 2018

View reviewed changes

schmamps added 8 commits August 17, 2018 15:36

update for GitHub anchors

4fa534b

update installed version

334b3f8

follow my own advice about dict.get()

f3078ea

simplify .text property

3da2e17

add get_path() to support addinfourl, TextIOWrapper classes

bea08ba

add HTTPResponse support

62c97c1

add HTTPResponse support

986c322

simple merge

0a76855

agusmba reviewed Feb 6, 2019

View reviewed changes

Office 365 support

e66b089

schmamps added 3 commits February 14, 2019 13:39

limit os import

12e71e4

add context management to zipfile

31eec56

correctly locate document relationships nomenclature fixes refactor utility functions limit imports protect document properties remove header and footer from document text (it's different from page to page)

documentation

045571a

nomenclature

schmamps changed the title ~~V080/pull request~~ big ball of mud addressing several issues Feb 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big ball of mud addressing several issues #22

big ball of mud addressing several issues #22

schmamps commented Aug 17, 2018

schmamps Aug 17, 2018

agusmba Feb 6, 2019

schmamps Feb 7, 2019

agusmba Feb 8, 2019

schmamps Feb 8, 2019

schmamps commented Feb 13, 2019

schmamps commented Feb 15, 2019

JMBurley commented Jun 20, 2019

SamMorrowDrums commented Aug 15, 2019 •

edited

Loading

ShayHill commented Aug 16, 2019 via email

big ball of mud addressing several issues #22

Are you sure you want to change the base?

big ball of mud addressing several issues #22

Conversation

schmamps commented Aug 17, 2018

Issues/PRs Addressed

Bonus Features

Documentation

Performance

Correctness

schmamps Aug 17, 2018

Choose a reason for hiding this comment

agusmba Feb 6, 2019

Choose a reason for hiding this comment

schmamps Feb 7, 2019

Choose a reason for hiding this comment

agusmba Feb 8, 2019

Choose a reason for hiding this comment

schmamps Feb 8, 2019

Choose a reason for hiding this comment

schmamps commented Feb 13, 2019

schmamps commented Feb 15, 2019

JMBurley commented Jun 20, 2019

SamMorrowDrums commented Aug 15, 2019 • edited Loading

ShayHill commented Aug 16, 2019 via email

SamMorrowDrums commented Aug 15, 2019 •

edited

Loading