New Problem #748
Replies: 2 comments
-
Personally, I think it's the left and right column format that makes the spacing of each row small, and increasing y_tolerace will affect the format of the extracted content, but I don't know how to fix it. |
Beta Was this translation helpful? Give feedback.
-
Hi @Godlikemandyy, I have converted this to a discussion rather than an issue, since this is specific to a particular PDF. I believe what is happening is that the y-positions have a different offset in the right vs. left columns. For that reason, I'd suggest using |
Beta Was this translation helpful? Give feedback.
-
@jsvine I have a new situation.
I solved the previous problem by adjusting y_tolerance, but I encountered the following problem with the new document.
The document format is as follows:
These sorts of documents are in left-right columns, and when I used extract_text() noncustom y_tolerace, I was getting text that was in error lines, and some of the text looked like this:
` 第一章 财务管理概论
使用 会计云课堂 或 微信 扫码快速做题 对答案 看解析
“ ” App “ ” 、 、 、
掌握解题思路 开启轻松过关之旅
, 。
有利于企业资源的合理配置
一、 单项选择题 C.
反映创造利润与投入资本之间的关系
D.
, 5. 下列各项措施中 不能协调股东和经营
联系 是贯彻责任制原则的要求 也是 ,
, , 者的利益冲突的是
构建激励与约束机制的关键环节的是 ( )。
通过市场约束经营者
A.
( )。 通过债权人约束经营者
财务决策 财务控制 B.
A. B.
给予经营者一定的股票期权
财务分析 财务评价 C.
C. D.
解聘经营者
D.
`
When I adjust the parameter y_tolerace=7, the extracted text is as follows:
In other words, when I magnify the y_tolerace value appropriately, I will solve some of the problem of dividing the text into two lines when it should be the same line, but at the same time I will extract the text from different lines into the same line, which will cause the content to be confused.
I have tried to adjust y_tolerace to different values, but the situation has not been resolved. I would like to know how to solve this problem to get the text in basically the same format as the document. I strongly hope you can reply, thank you!
Beta Was this translation helpful? Give feedback.
All reactions