Getting Natural Reading Order from multi-column document #965
Unanswered
tristancatteeuw
asked this question in
Looking for help
Replies: 1 comment 4 replies
-
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
I am aware that there are probably a lot of discussions on this isssue but I didn't find exactly what I was searching for.
So I have a CV pdf file like the one below (I can't really upload the whole pdf as it has private data):
My goal is to split this document into different sections and gather the text from inside each one (I define sections based on the font size).
For now I retrieve the text like this :
This method works generally well on most documents, but the problem here is that the header (Name, Email, Adress) as well as the section names (Education and Experience) are slightly more to the right than the rest of the text of the left column. So instead of getting the expected order, I get the main text of the Education and Experience sections, then the header and the section names, then the information on the right column.
I also tried to add a line
blocks.sort(key=lambda block: block["bbox"][1])
which solves this problem, but then the right column gets of course mixed with the rest as the y axis is the sorting criteria, and I don't want that either.I'm having troubles finding an approach where I can keep the best of both worlds here and achieve my goal. Thanks in advance for your help!
Beta Was this translation helpful? Give feedback.
All reactions