New Problem #748

Godlikemandyy · 2022-10-17T03:09:08Z

Godlikemandyy
Oct 17, 2022

@jsvine I have a new situation.
I solved the previous problem by adjusting y_tolerance, but I encountered the following problem with the new document.
The document format is as follows:

These sorts of documents are in left-right columns, and when I used extract_text() noncustom y_tolerace, I was getting text that was in error lines, and some of the text looked like this:
` 第一章财务管理概论
　
使用会计云课堂或微信扫码快速做题对答案看解析
“ ” App “ ” 、、、
掌握解题思路开启轻松过关之旅
, 。
有利于企业资源的合理配置
一、单项选择题 C.
反映创造利润与投入资本之间的关系
D.

下列各项财务管理环节中与奖惩紧密
, 5. 下列各项措施中不能协调股东和经营
联系是贯彻责任制原则的要求也是 ,
, , 者的利益冲突的是
构建激励与约束机制的关键环节的是 (　　 )。
通过市场约束经营者
A.
(　　 )。通过债权人约束经营者
财务决策财务控制 B.
A. B.
给予经营者一定的股票期权
财务分析财务评价 C.
C. D.
解聘经营者
D.
`
When I adjust the parameter y_tolerace=7, the extracted text is as follows:

In other words, when I magnify the y_tolerace value appropriately, I will solve some of the problem of dividing the text into two lines when it should be the same line, but at the same time I will extract the text from different lines into the same line, which will cause the content to be confused.
I have tried to adjust y_tolerace to different values, but the situation has not been resolved. I would like to know how to solve this problem to get the text in basically the same format as the document. I strongly hope you can reply, thank you!

Godlikemandyy · 2022-10-17T03:11:44Z

Godlikemandyy
Oct 17, 2022
Author

Personally, I think it's the left and right column format that makes the spacing of each row small, and increasing y_tolerace will affect the format of the extracted content, but I don't know how to fix it.

0 replies

jsvine · 2022-10-25T21:47:48Z

jsvine
Oct 25, 2022
Maintainer

Hi @Godlikemandyy, I have converted this to a discussion rather than an issue, since this is specific to a particular PDF. I believe what is happening is that the y-positions have a different offset in the right vs. left columns. For that reason, I'd suggest using Page.crop(...) to split the PDF in half (one half for each column), and then extracting the text from each half individually.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Problem #748

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

New Problem #748

Godlikemandyy Oct 17, 2022

Replies: 2 comments

Godlikemandyy Oct 17, 2022 Author

jsvine Oct 25, 2022 Maintainer

Godlikemandyy
Oct 17, 2022

Godlikemandyy
Oct 17, 2022
Author

jsvine
Oct 25, 2022
Maintainer