Getting Natural Reading Order from multi-column document #965

tristancatteeuw · 2021-03-23T16:24:00Z

tristancatteeuw
Mar 23, 2021

Hello,

I am aware that there are probably a lot of discussions on this isssue but I didn't find exactly what I was searching for.

So I have a CV pdf file like the one below (I can't really upload the whole pdf as it has private data):

My goal is to split this document into different sections and gather the text from inside each one (I define sections based on the font size).

For now I retrieve the text like this :

for page in doc:
    blocks = page.getText("dict", flags=11)["blocks"]
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
                if any(a.isdigit() for a in s["text"]) or any(a.isalpha() for a in s["text"]):
                    print(s["text"])

This method works generally well on most documents, but the problem here is that the header (Name, Email, Adress) as well as the section names (Education and Experience) are slightly more to the right than the rest of the text of the left column. So instead of getting the expected order, I get the main text of the Education and Experience sections, then the header and the section names, then the information on the right column.

I also tried to add a line blocks.sort(key=lambda block: block["bbox"][1]) which solves this problem, but then the right column gets of course mixed with the rest as the y axis is the sorting criteria, and I don't want that either.

I'm having troubles finding an approach where I can keep the best of both worlds here and achieve my goal. Thanks in advance for your help!

JorjMcKie · 2021-03-23T23:51:31Z

JorjMcKie
Mar 23, 2021
Maintainer

As is typical in this type of situation, the problem is lack of information.
If you would know for example, that e.g. everything herein

belongs to the same block, you wouldn't have a major problem, or would you.

So you need to seek more information:
Maybe you know that all those headers like LANGAUGES, HOBBIES, PERSONALITY etc. can always be found in the same positions across all similar documents?
Or at least they always are white text and located inside rectangles of that blueish color?

You may not be aware that there is page.get_drawings() which extracts things like those rectangles. So you make search the page for "LANGUAGES", ensure it is white text and there is a rectangle drawing item in list page.get_drawings() which contains the bbox of "LANGUAGES" and has the right blue color.
If this can be confirmed, then that blue rectangle delivers the top, left and right border for all listed languages.
There also is the clip parameter in method page.get_text(..., clip=rect) which will deliver only text inside clip.
You could also turn around the logic and walk through all blue rectangles of page.get_drawings() and extract any text they may contain via page.get_textbox(rect). If page.get_textbox(rect) == "LANGUAGES", then a rectangle containing "HOBBIES" would deliver the lower border of the languages information unit.
I hope I conveyed my point clearly enough ...

4 replies

tristancatteeuw Mar 24, 2021
Author

Thank you for your help! I will definitely check out what I can do with the rectangles as this seems useful, however the documents are so different from one another that I don't think I can rely on color or placement. But generally a lot of documents have a big rectangle on the left or right column that would be useful to isolate.

I will experiment a bit and keep you updated

JorjMcKie Mar 24, 2021
Maintainer

Good luck!
The rectangle fill color was just an idea - would have helped filtering out the irrelevant ones.
In any case you can look at the sublist of page.get_drawings() that consists of rectangles only. Then look at the text each one contains - if any.
If your documents however can have an arbitrary format, things will stay very complex ...

tristancatteeuw Mar 24, 2021
Author

Any idea how I can avoid duplicates? When I get all the rectangles, there are a lot that seem to overlap (sometimes they contain exactly the same text and sometimes part of it) and thus I am getting the same section of text twice in the output.

JorjMcKie Mar 24, 2021
Maintainer

Any idea how I can avoid duplicates?

No, that's up to your programming efforts. The document creators often do not focus on a slim doc structure. So you will have to sort the rectangles yourself and thus make duplicates detectable, insert logic, that keeps track of page areas from where text has been already extracted and more of that stuff ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Natural Reading Order from multi-column document #965

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Getting Natural Reading Order from multi-column document #965

tristancatteeuw Mar 23, 2021

Replies: 1 comment · 4 replies

JorjMcKie Mar 23, 2021 Maintainer

tristancatteeuw Mar 24, 2021 Author

JorjMcKie Mar 24, 2021 Maintainer

tristancatteeuw Mar 24, 2021 Author

JorjMcKie Mar 24, 2021 Maintainer

tristancatteeuw
Mar 23, 2021

Replies: 1 comment 4 replies

JorjMcKie
Mar 23, 2021
Maintainer

tristancatteeuw Mar 24, 2021
Author

JorjMcKie Mar 24, 2021
Maintainer

tristancatteeuw Mar 24, 2021
Author

JorjMcKie Mar 24, 2021
Maintainer