HTML output throws away text block information #1229
Replies: 2 comments 1 reply
-
The (X)HTML and XML outputs are thin wrappers of MuPDF functions. I have no influence on the output. |
Beta Was this translation helpful? Give feedback.
-
To be honest and to dim down your expectations: I doubt that Artifex will accept this as a bug, because there are no missing data, are there? They will probably (but I may be wrong) view this as purely cosmetic. Similar to MuPDF's logic for identifying blocks, you might deduct that information from the (vertical) distance between consecutive lines. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug (mandatory)
Converting a PDF to HTML throws away part of the document structure.
Other outputs types (e.g.
dict
) produce the following structure:But HTML has the following:
To Reproduce (mandatory)
This happens on any PDF, but I'm using lorem-two-para.pdf that looks like this:
Calling things like
get_text('dict')
andget_text('blocks')
shows that PyMuPDF correctly interprets the document as having two text blocks/paragraphs:But
get_text('html')
looks like this (note that I've removed the styling data for brevity):The output flattens the lines (represented
p
tags) and there's no way of knowing which lines belong to the same text blocks. This is a problem for me as I need to post-process the HTML output in a way that requires the block information.Expected behavior (optional)
Ideally the HTML output would (at least by way of an optional argument) include the text block part of structure as an element, e.g.
Your configuration (mandatory)
Beta Was this translation helpful? Give feedback.
All reactions