Layout preserving text extraction #1131
Replies: 6 comments 8 replies
-
The URL is invalid. |
Beta Was this translation helpful? Give feedback.
-
This is great! Would it make sense to add it to |
Beta Was this translation helpful? Give feedback.
-
@bexnoss - it is already part of PymuPDF itself: |
Beta Was this translation helpful? Give feedback.
-
Can we extract the text in layout preserving mode from another script without writing any files to disk? |
Beta Was this translation helpful? Give feedback.
-
@canklot Assuming you want to invoke the
>>> from fitz.__main__ import main as fitz_command
>>> cmd = "clean input.pdf output.pdf -pages 1,N".split() # prepare command line
>>> saved_parms = sys.argv[1:] # save original command line
>>> sys.argv[1:] = cmd # store new command line
>>> fitz_command() # execute module
>>> sys.argv[1:] = saved_parms # restore original command line |
Beta Was this translation helpful? Give feedback.
-
Will this work for scanned PDFs? Can we can apply oct then try this? |
Beta Was this translation helpful? Give feedback.
-
There is a new script, fitzcli.py, which extracts document text in a layout-preserving way.
While this is new and certainly not bug-free, it produces quite encouraging results already.
Give it a try.
Beta Was this translation helpful? Give feedback.
All reactions