-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
big ball of mud addressing several issues #22
base: master
Are you sure you want to change the base?
Conversation
add -d/--details flag for all document data reduce calls to qn() in xml2text add DocxFile class for object-type access add get_output() for script invocation
increment version no.
text += '\t' | ||
elif child.tag in (qn('w:br'), qn('w:cr')): | ||
text += '\n' | ||
elif child.tag == qn("w:p"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qn()
can be called as many as four times per node!
docx2txt/docx_file.py
Outdated
|
||
zipf = zipfile.ZipFile(path) | ||
paths = {} | ||
for fname in ['_rels/.rels', 'word/_rels/document.xml.rels']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least in my office365 docx, the second file is named document2.xml.rels
so it might not be a good idea to hardcode this, but instead deduce its name from the content of the _rels/.rels
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of my files have this structure. I can't even fake it because there is no XML reference to word/_rels/document.xml.rels
.
I could guess where that file is, but would prefer to know. A sample file would be very helpful because I don't have a 365 subscription.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it helps, here you have a simple Word Online document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was extremely helpful. Looks like 365 does this for no particular reason.
OK, I get it now. That works but it's wrong, and so help us if Microsoft ever goes to from os.path import basename, dirname
src_path = '_rels/.rels'
src_selector = '/Relationships/Relationship[Type=http://schemas…officeDocument]'
src_attr = 'Target'
doc_path = some_xml_function(src_path, src_selector, src_attr)
doc_rels = '{}/_rels/{}.rels'.format(dirname(doc_path), basename(doc_path)) I'll push that fix after accounting for another strange possibility: multiple nodes of Type |
correctly locate document relationships nomenclature fixes refactor utility functions limit imports protect document properties remove header and footer from document text (it's different from page to page)
nomenclature
After a shallow dive into Open Packaging Conventions, this looks pretty faithful to the standard. |
This is pretty great & a significant functionality upgrade. @ankushshah89 Any chance of merging this into master? Failing that, I'm willing to go and repackage the updated version and get it on PyPI. But I'd prefer to see original authors (incl. @schmamps ) get direct credit. |
Yes I've had to use my fork (and open PR) version for years because office365 suport is not optional for me. I see there was one change recently by @ankushshah89 - with a version bump for a compatibility bug, but I'd love to know why various attempts to fix the support issue have not been merged. If the maintainer(s) would like somebody else to get involved please do let us know because the compatibility issue is now so frequent that this repo is effectively obsolete, unless the issue is addressed. It's almost 2 years since I attempted to fix the issue - and it's sad that there have been a few commits since, but the issue remains ignored. We want to help and make this work, it's great having things like this and nobody want's to be rude or impatient, but also nobody wants the upstream version of a module to be dysfunctional for years without some kind of action being taken. If we don't hear back from this then I certainly would back forking and publishing - out of desperation, rather than choice. |
My 'fork' (docx2python) should work with these. If you have files that do not work, please forward them to me.
…-------- Original message --------
From: Sam Morrow <[email protected]>
Date: 8/15/19 9:00 AM (GMT-06:00)
To: ankushshah89/python-docx2txt <[email protected]>
Cc: Subscribed <[email protected]>
Subject: Re: [ankushshah89/python-docx2txt] big ball of mud addressing several issues (#22)
Yes I've had to use my fork (and open PR) version for months because office365 suport is not optional for me. I see there was one change recently by @ankushshah89<https://github.com/ankushshah89> - with a version bump for a compatibility bug, but I'd love to know why various attempts to fix the support issue have not been merged.
If the maintainer(s) would like somebody else to get involved please do let us know because the compatibility issue is now so frequent that this repo is effectively obsolete, unless the issue is addressed.
It's almost 2 years since I attempted to fix the issue - and it's sad that there have been a few commits since, but the issue remains ignored.
We want to help and make this work, it's great having things like this and nobody want's to be rude or impatient, but also nobody wants the upstream version of a module to be dysfunctional for years without some kind of action being taken.
If we don't hear back from this then I certainly would back forking and publishing - out of desperation, rather than choice.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#22?email_source=notifications&email_token=ADAKIE6B7LTMTIBPECLX74LQEVORLA5CNFSM4FQJBZV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4L4K6I#issuecomment-521651577>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ADAKIE6F4C3KITXDXKDUQ63QEVORLANCNFSM4FQJBZVQ>.
|
Yeah, there's a whole lot happening here. This started as just wanting the document properties, but the process ended up addressing issues and wrapping everything in a class to maintain backwards compatibility.
Issues/PRs Addressed
encode()
is only called in Python < 3.header
,.footer
,.main
and.images
propertieslstrip()
(although I haven't seen the problem)Bonus Features
Documentation
The README is thorough.
Performance
Additional XML parsing overhead is more than offset by reducing the complexity of
xml2text()
.Correctness
Errors are output to stderr instead of
print()
ed.