Skip to content

mainly for Chinese pdfs, convert pdf to markdown by fitz package and paddleocr

Notifications You must be signed in to change notification settings

InsaneGe/pdf2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Type of pdfs

PDF files can be divided into two types: text-based (such as generated by the word processing software) and image-based (such as generated by scanning paper documents).

Text-based PDF: This type of PDF contain real text information, each character is encoded and has a clear location, font and other attributes, you can directly access and manipulate the text data. Image-based PDF: This type of PDF is generated by scanning paper documents and saved in the form of an image , there is no independent text information, you can try OCR to extract text from the image.

Dependencies

pip install -r requirements.txt

if you have GPUs, you can install the GPU version of PaddlePaddle, It is up to your environment. referring to this link: https://www.paddlepaddle.org.cn/documentation/docs/zh/install/pip/windows-pip.html

pip install paddlepaddle-gpu==2.6.1 -i https://mirror.baidu.com/pypi/simple

Run

python start.py -h

  • -type:file or dir or photo
  • -f:a single pdf file path
  • -d:pdf directory path
  • -p:a single photo path

Limits

It is suitable for dealing with Chinese pdf of single column layout, horizontal typography without formulas, matrices .

Why Chinese? Because the Paddleocr recognize Chinese better than other languages.

if you want to convert English pdfs to markdown, I would recommend marker. and if you can accept paid service, I recommend mathpix

Time

Depending on the composition of this pdf page, it may take 7-12 seconds to parse one page.

Disadvantages

  • It can't handle this kind of vertically typography pdfs

but during my testing, mathpix also did a poor job of recognizing this type of pdf.

origin2

  • Paddleocr has limited completeness in recognizing pdfs with low clarity.

for example,the red circle is the unrecognized area of the Paddleocr.

origin1

the result of parsing this page above:

res1

  • PPStructure only supports English and Chinese. Its recognizable Chinese layouts are text, title, figure, figure_caption, table, table_caption, header, footer, reference, equation, it doesn't support mathematical formulas, matrices and so on.

you can see from PPStructure source code:

'PP-StructureV2': {
            'layout': {
                'en': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt'
                },
                'ch': {
                    'url':
                    'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar',
                    'dict_path':
                    'ppocr/utils/dict/layout_dict/layout_cdla_dict.txt'
                }
            }

the content of layout_cdla_dict.txt is as follows:

text
title
figure
figure_caption
table
table_caption
header
footer
reference
equation

About

mainly for Chinese pdfs, convert pdf to markdown by fitz package and paddleocr

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages