HTML output throws away text block information #1229

ejohb · 2021-08-24T11:03:09Z

ejohb
Aug 24, 2021

Describe the bug (mandatory)

Converting a PDF to HTML throws away part of the document structure.

Other outputs types (e.g. dict) produce the following structure:

page
- text block
  - line
    - span

But HTML has the following:

page
- line
  - span

To Reproduce (mandatory)

This happens on any PDF, but I'm using lorem-two-para.pdf that looks like this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Calling things like get_text('dict') and get_text('blocks') shows that PyMuPDF correctly interprets the document as having two text blocks/paragraphs:

[
    (56.76000213623047, 70.90341186523438, 524.3014526367188, 96.03419494628906,
     'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna \naliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. \n',
     0, 0),
    (56.76000213623047, 121.30341339111328, 503.4024658203125, 146.43418884277344,
     'Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint \noccaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n',
     1, 0)
]

But get_text('html') looks like this (note that I've removed the styling data for brevity):

<div id="page0">
    <p>
        <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
    </p>
    <p>
        <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
    </p>
    <p>
        <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
    </p>
    <p>
        <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
    </p>
</div>

The output flattens the lines (represented p tags) and there's no way of knowing which lines belong to the same text blocks. This is a problem for me as I need to post-process the HTML output in a way that requires the block information.

Expected behavior (optional)

Ideally the HTML output would (at least by way of an optional argument) include the text block part of structure as an element, e.g.

<div id="page0">
    <div id="page0block0">
        <p>
            <span>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna </span>
        </p>
        <p>
            <span>aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. </span>
        </p>
    </div>
    <div id="page0block1">
        <p>
            <span>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint </span>
        </p>
        <p>
            <span>occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.</span>
        </p>
    </div>
</div>

Your configuration (mandatory)

3.9.4 (default, Apr 10 2021, 15:31:19)
[GCC 8.3.0]
 linux

PyMuPDF 1.18.16: Python bindings for the MuPDF 1.18.0 library.
Version date: 2021-08-05 00:00:01.
Built for Python 3.9 on linux (64-bit).

JorjMcKie · 2021-08-24T11:09:15Z

JorjMcKie
Aug 24, 2021
Maintainer

The (X)HTML and XML outputs are thin wrappers of MuPDF functions. I have no influence on the output.
Please report your issue to https://bugs.ghostscript.com/enter_bug.cgi.
I will convert this to a Discussion with tag "upstream bug".

0 replies

JorjMcKie · 2021-08-24T11:12:08Z

JorjMcKie
Aug 24, 2021
Maintainer

To be honest and to dim down your expectations: I doubt that Artifex will accept this as a bug, because there are no missing data, are there? They will probably (but I may be wrong) view this as purely cosmetic.

Similar to MuPDF's logic for identifying blocks, you might deduct that information from the (vertical) distance between consecutive lines.

1 reply

JorjMcKie Aug 24, 2021
Maintainer

As a side note:
The XML output does contain blocks, lines and characters (no spans).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTML output throws away text block information #1229

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

HTML output throws away text block information #1229

Uh oh!

Uh oh!

ejohb Aug 24, 2021

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

Replies: 2 comments · 1 reply

Uh oh!

JorjMcKie Aug 24, 2021 Maintainer

Uh oh!

Uh oh!

JorjMcKie Aug 24, 2021 Maintainer

Uh oh!

JorjMcKie Aug 24, 2021 Maintainer

ejohb
Aug 24, 2021

Replies: 2 comments 1 reply

JorjMcKie
Aug 24, 2021
Maintainer

JorjMcKie
Aug 24, 2021
Maintainer

JorjMcKie Aug 24, 2021
Maintainer