Layout preserving text extraction #1131

JorjMcKie · 2021-07-08T11:24:06Z

JorjMcKie
Jul 8, 2021
Maintainer

There is a new script, fitzcli.py, which extracts document text in a layout-preserving way.
While this is new and certainly not bug-free, it produces quite encouraging results already.
Give it a try.

mlove4u · 2021-09-17T13:44:43Z

mlove4u
Sep 17, 2021

The URL is invalid.

1 reply

JorjMcKie Sep 18, 2021
Maintainer Author

thanks - corrected it.

bexnoss · 2021-10-22T00:43:23Z

bexnoss
Oct 22, 2021

This is great!

Would it make sense to add it to fitz.Page and add a clip rect? I'm currently adapting the page_layout like that because I already have rects of table cells and need to extract the text from them.

0 replies

JorjMcKie · 2021-10-22T06:58:27Z

JorjMcKie
Oct 22, 2021
Maintainer Author

@bexnoss - it is already part of PymuPDF itself: python -m fitz gettext ... does the same thing.
It's in the documention here.

5 replies

bexnoss Oct 22, 2021

Yeah, but that's for the whole page. My use case is that I've already found the location of a table cell on a page and am only interested in the text that is in that cell. I think it could be added as fitz.utils.get_text_layout.

JorjMcKie Oct 22, 2021
Maintainer Author

You can extract text from within a rectangle text = fitz.get_textbox(rect) which is not layout preserving, however.

bexnoss Oct 22, 2021

This is what I mean: (edit: this still ignores the x axis)

page_layout_clip

def page_layout_clip(
    page: fitz.Page,
    clip: fitz.Rect,
    textout: io.BytesIO,
    GRID=2,
    fontsize=3,
    noformfeed=False,
    skip_empty=False,
    flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE,
):
    eop = b"\n" if noformfeed else bytes([12])

    # --------------------------------------------------------------------
    def find_line_index(values: List[int], value: int) -> int:
        """Find the right row coordinate.

        Args:
            values: (list) y-coordinates of rows.
            value: (int) lookup for this value (y-origin of char).
        Returns:
            y-ccordinate of appropriate line for value.
        """
        i = bisect.bisect_right(values, value)
        if i:
            return values[i - 1]
        raise RuntimeError("Line for %g not found in %s" % (value, values))

    # --------------------------------------------------------------------
    def curate_rows(rows: Set[int], GRID) -> List:
        rows = list(rows)
        rows.sort()  # sort ascending
        nrows = [rows[0]]
        for h in rows[1:]:
            if h >= nrows[-1] + GRID:  # only keep significant differences
                nrows.append(h)
        return nrows  # curated list of line bottom coordinates

    def process_blocks(blocks: List[Dict], page: fitz.Page):
        rows = set()
        page_width = clip.width
        page_height = clip.height
        rowheight = page_height
        left = page_width
        right = 0
        chars = []
        for block in blocks:
            for line in block["lines"]:
                if line["dir"] != (1, 0):  # ignore non-horizontal text
                    continue
                x0, y0, x1, y1 = line["bbox"]
                if y1 < clip.y0 or y0 > clip.y1:  # ignore if outside CropBox
                    continue
                # upd row height
                height = y1 - y0

                if rowheight > height:
                    rowheight = height
                for span in line["spans"]:
                    if span["size"] <= fontsize:
                        continue
                    for c in span["chars"]:
                        x0, _, x1, _ = c["bbox"]
                        cwidth = x1 - x0
                        ox, oy = c["origin"]
                        oy = int(round(oy))
                        rows.add(oy)
                        ch = c["c"]
                        if left > ox and ch != " ":
                            left = ox  # update left coordinate
                        if right < x1:
                            right = x1  # update right coordinate
                        # handle ligatures:
                        if cwidth == 0 and chars != []:  # potential ligature
                            old_ch, old_ox, old_oy, old_cwidth = chars[-1]
                            if old_oy == oy:  # ligature
                                if old_ch != chr(0xFB00):  # previous "ff" char lig?
                                    lig = joinligature(old_ch + ch)  # no
                                # convert to one of the 3-char ligatures:
                                elif ch == "i":
                                    lig = chr(0xFB03)  # "ffi"
                                elif ch == "l":
                                    lig = chr(0xFB04)  # "ffl"
                                else:  # something wrong, leave old char in place
                                    lig = old_ch
                                chars[-1] = (lig, old_ox, old_oy, old_cwidth)
                                continue
                        chars.append((ch, ox, oy, cwidth))  # all chars on page
        return chars, rows, left, right, rowheight

    def joinligature(lig: str) -> str:
        """Return ligature character for a given pair / triple of characters.

        Args:
            lig: (str) 2/3 characters, e.g. "ff"
        Returns:
            Ligature, e.g. "ff" -> chr(0xFB00)
        """

        if lig == "ff":
            return chr(0xFB00)
        elif lig == "fi":
            return chr(0xFB01)
        elif lig == "fl":
            return chr(0xFB02)
        elif lig == "ffi":
            return chr(0xFB03)
        elif lig == "ffl":
            return chr(0xFB04)
        elif lig == "ft":
            return chr(0xFB05)
        elif lig == "st":
            return chr(0xFB06)
        return lig

    # --------------------------------------------------------------------
    def make_textline(left, slot, minslot, lchars):
        """Produce the text of one output line.

        Args:
            left: (float) left most coordinate used on page
            slot: (float) avg width of one character in any font in use.
            minslot: (float) min width for the characters in this line.
            chars: (list[tuple]) characters of this line.
        Returns:
            text: (str) text string for this line
        """
        text = ""  # we output this
        old_char = ""
        old_x1 = 0  # end coordinate of last char
        old_ox = 0  # x-origin of last char
        if minslot <= fitz.EPSILON:
            raise RuntimeError("program error: minslot too small = %g" % minslot)

        for c in lchars:  # loop over characters
            char, ox, _, cwidth = c
            ox = ox - left  # its (relative) start coordinate
            x1 = ox + cwidth  # ending coordinate

            # eliminate overprint effect
            if old_char == char and ox - old_ox <= cwidth * 0.2:
                continue

            # omit spaces overlapping previous char
            if char == " " and (old_x1 - ox) / cwidth > 0.8:
                continue

            old_char = char
            # close enough to previous?
            if ox < old_x1 + minslot:  # assume char adjacent to previous
                text += char  # append to output
                old_x1 = x1  # new end coord
                old_ox = ox  # new origin.x
                continue

            # else next char starts after some gap:
            # fill in right number of spaces, so char is positioned
            # in the right slot of the line
            if char == " ":  # rest relevant for non-space only
                continue
            delta = int(ox / slot) - len(text)
            if ox > old_x1 and delta > 1:
                text += " " * delta
            # now append char
            text += char
            old_x1 = x1  # new end coordinate
            old_ox = ox  # new origin
        return text.rstrip()

    # extract page text by single characters ("rawdict")
    blocks = page.get_text("rawdict", flags=flags)["blocks"]
    chars, rows, left, right, rowheight = process_blocks(blocks, page)

    if chars == []:
        if not skip_empty:
            textout.write(eop)  # write formfeed
        return
    # compute list of line coordinates - ignoring small (GRID) differences
    rows = curate_rows(rows, GRID)

    # sort all chars by x-coordinates, so every line will receive char info,
    # sorted from left to right.
    chars.sort(key=lambda c: c[1])

    # populate the lines with their char info
    lines = {}  # key: y1-ccordinate, value: char list
    for c in chars:
        _, _, oy, _ = c
        y = find_line_index(rows, oy)  # y-coord of the right line
        lchars = lines.get(y, [])  # read line chars so far
        lchars.append(c)  # append this char
        lines[y] = lchars  # write back to line

    # ensure line coordinates are ascending
    keys = list(lines.keys())
    keys.sort()

    # -------------------------------------------------------------------------
    # Compute "char resolution" for the page: the char width corresponding to
    # 1 text char position on output - call it 'slot'.
    # For each line, compute median of its char widths. The minimum across all
    # lines is 'slot'.
    # The minimum char width of each line is used to determine if spaces must
    # be inserted in between two characters.
    # -------------------------------------------------------------------------
    slot = right - left
    minslots = {}
    for k in keys:
        lchars = lines[k]
        ccount = len(lchars)
        if ccount < 2:
            minslots[k] = 1
            continue
        widths = [c[3] for c in lchars]
        widths.sort()
        this_slot = statistics.median(widths)  # take median value
        if this_slot < slot:
            slot = this_slot
        minslots[k] = widths[0]

    # compute line advance in text output
    rowheight = rowheight * (rows[-1] - rows[0]) / (rowheight * len(rows)) * 1.2
    rowpos = rows[0]  # first line positioned here
    textout.write(b"\n")
    for k in keys:  # walk through the lines
        while rowpos < k:  # honor distance between lines
            textout.write(b"\n")
            rowpos += rowheight
        text = make_textline(left, slot, minslots[k], lines[k])
        textout.write((text + "\n").encode("utf8"))
        rowpos = k + rowheight

    textout.write(eop)  # write formfeed

This extracts the text with layout from a clip rect on the page. The only changes to the original are adding the defaults to the arguments and adjusting process_blocks to use the clip rect instead of the whole page. I'm not familiar with the internals of this library but the functionality is similar to the fitz.utils.get_text function so I think this functionality would be a good addition under fitz.utils.get_text_layout. Does that make sense? And if so do you want a PR for that?

bexnoss Oct 22, 2021

You can extract text from within a rectangle text = fitz.get_textbox(rect) which is not layout preserving, however.

I haven't looked into what exactly is happening with fitz.get_textbox but I'm seeing better results with fitz.get_text_blocks. fitz.get_textbox seems to skip text that is included with fitz.get_text_blocks.

I also noticed that text must be contained, so I added a small padding to also get the text that only intersects. Is there a better way to get all text that intersects a rect?

JorjMcKie Oct 22, 2021
Maintainer Author

I also noticed that text must be contained, so I added a small padding to also get the text that only intersects. Is there a better way to get all text that intersects a rect?

You could use page.get_text("words") which delivers white-space-surrounded text pieces - each with its bbox: (x0, y0, x1, y1, "word", ...).
Filter that list for strings that intersect a given rectangle and you should be there.

The get_textbox() method doesn't give a damn whether or not it cuts right through a non-empty string - and indeed only includes fully contained characters.
You can on the other hand influence this a bit by setting fitz.TOOLS.set_small_glyph_heights(True). This takes away extra space from above and below each character (many fonts like Helvetica have generous values for ascender and descender). Doing this make the character (virtually) smaller for searching / inclusion checks, so chances are greater to see containments when using get_textbox().

canklot · 2023-02-02T21:19:06Z

canklot
Feb 2, 2023

Can we extract the text in layout preserving mode from another script without writing any files to disk?

1 reply

JorjMcKie Feb 3, 2023
Maintainer Author

you mean the fitz module python -m fitz gettext ...?

JorjMcKie · 2023-02-03T09:50:01Z

JorjMcKie
Feb 3, 2023
Maintainer Author

@canklot Assuming you want to invoke the fitz module, please see the documentation:

How to use the module inside your script:

>>> from fitz.__main__ import main as fitz_command
>>> cmd = "clean input.pdf output.pdf -pages 1,N".split()  # prepare command line
>>> saved_parms = sys.argv[1:]  # save original command line
>>> sys.argv[1:] = cmd  # store new command line
>>> fitz_command()  # execute module
>>> sys.argv[1:] = saved_parms  # restore original command line

0 replies

vignxs · 2023-12-21T14:26:59Z

vignxs
Dec 21, 2023

Will this work for scanned PDFs?

Can we can apply oct then try this?
Thanks in advance

1 reply

JorjMcKie Dec 21, 2023
Maintainer Author

There is no in-built fallback to OCR in that module. Neither any awareness about whether dealing with OCRed text or not.
So if you OCR before invoking this, you should be ok - apart from the usual mess that generally accompanies OCR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layout preserving text extraction #1131

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Layout preserving text extraction #1131

JorjMcKie Jul 8, 2021 Maintainer

Replies: 6 comments · 8 replies

mlove4u Sep 17, 2021

JorjMcKie Sep 18, 2021 Maintainer Author

bexnoss Oct 22, 2021

JorjMcKie Oct 22, 2021 Maintainer Author

bexnoss Oct 22, 2021

JorjMcKie Oct 22, 2021 Maintainer Author

bexnoss Oct 22, 2021

bexnoss Oct 22, 2021

JorjMcKie Oct 22, 2021 Maintainer Author

canklot Feb 2, 2023

JorjMcKie Feb 3, 2023 Maintainer Author

JorjMcKie Feb 3, 2023 Maintainer Author

vignxs Dec 21, 2023

JorjMcKie Dec 21, 2023 Maintainer Author

JorjMcKie
Jul 8, 2021
Maintainer

Replies: 6 comments 8 replies

mlove4u
Sep 17, 2021

JorjMcKie Sep 18, 2021
Maintainer Author

bexnoss
Oct 22, 2021

JorjMcKie
Oct 22, 2021
Maintainer Author

JorjMcKie Oct 22, 2021
Maintainer Author

JorjMcKie Oct 22, 2021
Maintainer Author

canklot
Feb 2, 2023

JorjMcKie Feb 3, 2023
Maintainer Author

JorjMcKie
Feb 3, 2023
Maintainer Author

vignxs
Dec 21, 2023

JorjMcKie Dec 21, 2023
Maintainer Author