From 237c062de4f0d96c355e5f9b68d3781a5f2e4f7d Mon Sep 17 00:00:00 2001 From: myhloli Date: Wed, 16 Oct 2024 10:16:49 +0800 Subject: [PATCH] docs: enhance document parsing capabilities - Improve reading order with model-based sorting- Add list recognition within text - Implement table of contents recognition - Support table recognition - Enhance code block and geometric shape recognition - Address known issues in both English and Chinese READMEs --- README.md | 14 ++++++++------ README_zh-CN.md | 12 +++++++----- 2 files changed, 15 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 3e4fd746..6d296423 100644 --- a/README.md +++ b/README.md @@ -339,19 +339,21 @@ TODO # TODO -- [x] Semantic-based reading order -- [ ] List recognition within the text +- [x] Model-based reading order +- [x] List recognition within the text - [ ] Code block recognition within the text -- [ ] Table of contents recognition +- [x] Table of contents recognition - [x] Table recognition - [ ] [Chemical formula recognition](docs/chemical_knowledge_introduction/introduction.pdf) - [ ] Geometric shape recognition # Known Issues -- Reading order is segmented based on rules, which can cause disordered sequences in some cases -- Vertical text is not supported -- Lists, code blocks, and table of contents are not yet supported in the layout model +- Reading order is based on the model's sorting of text distribution in space, which may become disordered under extremely complex layouts. +- Vertical text is not supported. +- Tables of contents and lists are recognized through rules; a few uncommon list formats may not be identified. +- Only one level of headings is supported; hierarchical heading levels are currently not supported. +- Code blocks are not yet supported in the layout model. - Comic books, art books, elementary school textbooks, and exercise books are not well-parsed yet - Enabling OCR may produce better results in PDFs with a high density of formulas - If you are processing PDFs with a large number of formulas, it is strongly recommended to enable the OCR function. When using PyMuPDF to extract text, overlapping text lines can occur, leading to inaccurate formula insertion positions. diff --git a/README_zh-CN.md b/README_zh-CN.md index 3eef92e4..9e42ca9f 100644 --- a/README_zh-CN.md +++ b/README_zh-CN.md @@ -341,19 +341,21 @@ TODO # TODO -- [x] 基于语义的阅读顺序 -- [ ] 正文中列表识别 +- [x] 基于模型的阅读顺序 +- [x] 正文中列表识别 - [ ] 正文中代码块识别 -- [ ] 目录识别 +- [x] 目录识别 - [x] 表格识别 - [ ] [化学式识别](docs/chemical_knowledge_introduction/introduction.pdf) - [ ] 几何图形识别 # Known Issues -- 阅读顺序基于规则的分割,在一些情况下会乱序 +- 阅读顺序基于模型对文本在空间中的分布进行排序,在极端复杂的排版下可能会乱序 - 不支持竖排文字 -- 列表、代码块、目录在layout模型里还没有支持 +- 目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别 +- 标题只有一级,目前不支持标题分级 +- 代码块在layout模型里还没有支持 - 漫画书、艺术图册、小学教材、习题尚不能很好解析 - 在一些公式密集的PDF上强制启用OCR效果会更好 - 如果您要处理包含大量公式的pdf,强烈建议开启OCR功能。使用pymuPDF提取文字的时候会出现文本行互相重叠的情况导致公式插入位置不准确。