diff --git a/.readthedocs.yaml b/.readthedocs.yaml new file mode 100644 index 00000000..974d040f --- /dev/null +++ b/.readthedocs.yaml @@ -0,0 +1,16 @@ +version: 2 + +build: + os: ubuntu-22.04 + tools: + python: "3.10" + +formats: + - epub + +python: + install: + - requirements: docs/zh_cn/requirements.txt + +sphinx: + configuration: docs/zh_cn/conf.py diff --git a/docs/README_Ubuntu_CUDA_Acceleration_en_US.md b/docs/README_Ubuntu_CUDA_Acceleration_en_US.md deleted file mode 100644 index 3c456d79..00000000 --- a/docs/README_Ubuntu_CUDA_Acceleration_en_US.md +++ /dev/null @@ -1,101 +0,0 @@ - -# Ubuntu 22.04 LTS - -### 1. Check if NVIDIA Drivers Are Installed - ```sh - nvidia-smi - ``` - If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2. - ```plaintext - +---------------------------------------------------------------------------------------+ - | NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | - |-----------------------------------------+----------------------+----------------------+ - | GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | - | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | - | | | MIG M. | - |=========================================+======================+======================| - | 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:01:00.0 On | N/A | - | 0% 51C P8 12W / 200W | 1489MiB / 8192MiB | 5% Default | - | | | N/A | - +-----------------------------------------+----------------------+----------------------+ - ``` - -### 2. Install the Driver - If no driver is installed, use the following command: - ```sh - sudo apt-get update - sudo apt-get install nvidia-driver-545 - ``` - Install the proprietary driver and restart your computer after installation. - ```sh - reboot - ``` - -### 3. Install Anaconda - If Anaconda is already installed, skip this step. - ```sh - wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh - bash Anaconda3-2024.06-1-Linux-x86_64.sh - ``` - In the final step, enter `yes`, close the terminal, and reopen it. - -### 4. Create an Environment Using Conda - Specify Python version 3.10. - ```sh - conda create -n MinerU python=3.10 - conda activate MinerU - ``` - -### 5. Install Applications - ```sh - pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com - ``` -❗ After installation, make sure to check the version of `magic-pdf` using the following command: - ```sh - magic-pdf --version - ``` - If the version number is less than 0.7.0, please report the issue. - -### 6. Download Models - Refer to detailed instructions on [how to download model files](how_to_download_models_en.md). - -## 7. Understand the Location of the Configuration File - -After completing the [6. Download Models](#6-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path. -You can find the `magic-pdf.json` file in your user directory. -> The user directory for Linux is "/home/username". - -### 8. First Run - Download a sample file from the repository and test it. - ```sh - wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf - magic-pdf -p small_ocr.pdf - ``` - -### 9. Test CUDA Acceleration - -If your graphics card has at least **8GB** of VRAM, follow these steps to test CUDA acceleration: - -> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application. - -1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory. - ```json - { - "device-mode": "cuda" - } - ``` -2. Test CUDA acceleration with the following command: - ```sh - magic-pdf -p small_ocr.pdf - ``` - -### 10. Enable CUDA Acceleration for OCR - -1. Download `paddlepaddle-gpu`. Installation will automatically enable OCR acceleration. - ```sh - python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ - ``` -2. Test OCR acceleration with the following command: - ```sh - magic-pdf -p small_ocr.pdf - ``` diff --git a/docs/README_Windows_CUDA_Acceleration_en_US.md b/docs/README_Windows_CUDA_Acceleration_en_US.md deleted file mode 100644 index d39c6c3b..00000000 --- a/docs/README_Windows_CUDA_Acceleration_en_US.md +++ /dev/null @@ -1,82 +0,0 @@ -# Windows 10/11 - -### 1. Install CUDA and cuDNN -Required versions: CUDA 11.8 + cuDNN 8.7.0 - - CUDA 11.8: https://developer.nvidia.com/cuda-11-8-0-download-archive - - cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x: https://developer.nvidia.com/rdp/cudnn-archive - -### 2. Install Anaconda - If Anaconda is already installed, you can skip this step. - -Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86_64.exe - -### 3. Create an Environment Using Conda - Python version must be 3.10. - ``` - conda create -n MinerU python=3.10 - conda activate MinerU - ``` - -### 4. Install Applications - ``` - pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com - ``` - >❗️After installation, verify the version of `magic-pdf`: - > ```bash - > magic-pdf --version - > ``` - > If the version number is less than 0.7.0, please report it in the issues section. - -### 5. Download Models - Refer to detailed instructions on [how to download model files](how_to_download_models_en.md). - -### 6. Understand the Location of the Configuration File - -After completing the [5. Download Models](#5-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path. -You can find the `magic-pdf.json` file in your 【user directory】 . -> The user directory for Windows is "C:/Users/username". - -### 7. First Run - Download a sample file from the repository and test it. - ```powershell - wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf - magic-pdf -p small_ocr.pdf - ``` - -### 8. Test CUDA Acceleration - If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-accelerated parsing performance. - -> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application. - - 1. **Overwrite the installation of torch and torchvision** supporting CUDA. - ``` - pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118 - ``` - >❗️Ensure the following versions are specified in the command: - >``` - > torch==2.3.1 torchvision==0.18.1 - >``` - >These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail. - 2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory. - - ```json - { - "device-mode": "cuda" - } - ``` - 3. **Run the following command to test CUDA acceleration**: - - ``` - magic-pdf -p small_ocr.pdf - ``` - -### 9. Enable CUDA Acceleration for OCR - -1. **Download paddlepaddle-gpu**, which will automatically enable OCR acceleration upon installation. - ``` - pip install paddlepaddle-gpu==2.6.1 - ``` -2. **Run the following command to test OCR acceleration**: - ``` - magic-pdf -p small_ocr.pdf - ``` diff --git a/docs/en/.readthedocs.yaml b/docs/en/.readthedocs.yaml new file mode 100644 index 00000000..7e3312e4 --- /dev/null +++ b/docs/en/.readthedocs.yaml @@ -0,0 +1,16 @@ +version: 2 + +build: + os: ubuntu-22.04 + tools: + python: "3.10" + +formats: + - epub + +python: + install: + - requirements: docs/requirements.txt + +sphinx: + configuration: docs/en/conf.py diff --git a/docs/en/Makefile b/docs/en/Makefile new file mode 100644 index 00000000..d4bb2cbb --- /dev/null +++ b/docs/en/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/images/MinerU-logo.png b/docs/en/_static/image/logo.png similarity index 100% rename from docs/images/MinerU-logo.png rename to docs/en/_static/image/logo.png diff --git a/docs/en/conf.py b/docs/en/conf.py new file mode 100644 index 00000000..c3fc7f2f --- /dev/null +++ b/docs/en/conf.py @@ -0,0 +1,122 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. + +import os +import subprocess +import sys + +from sphinx.ext import autodoc + + +def install(package): + subprocess.check_call([sys.executable, '-m', 'pip', 'install', package]) + + +requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt')) +if os.path.exists(requirements_path): + with open(requirements_path) as f: + packages = f.readlines() + for package in packages: + install(package.strip()) + +sys.path.insert(0, os.path.abspath('../..')) + +# -- Project information ----------------------------------------------------- + +project = 'MinerU' +copyright = '2024, MinerU Contributors' +author = 'OpenDataLab' + +# The full version, including alpha/beta/rc tags +version_file = '../../magic_pdf/libs/version.py' +with open(version_file) as f: + exec(compile(f.read(), version_file, 'exec')) +__version__ = locals()['__version__'] +# The short X.Y version +version = __version__ +# The full version, including alpha/beta/rc tags +release = __version__ + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'sphinx.ext.napoleon', + 'sphinx.ext.viewcode', + 'sphinx.ext.intersphinx', + 'sphinx_copybutton', + 'sphinx.ext.autodoc', + 'sphinx.ext.autosummary', + 'myst_parser', + 'sphinxarg.ext', +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# Exclude the prompt "$" when copying code +copybutton_prompt_text = r'\$ ' +copybutton_prompt_is_regexp = True + +language = 'en' + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sphinx_book_theme' +html_logo = '_static/image/logo.png' +html_theme_options = { + 'path_to_docs': 'docs/en', + 'repository_url': 'https://github.com/opendatalab/MinerU', + 'use_repository_button': True, +} +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +# html_static_path = ['_static'] + +# Mock out external dependencies here. +autodoc_mock_imports = [ + 'cpuinfo', + 'torch', + 'transformers', + 'psutil', + 'prometheus_client', + 'sentencepiece', + 'vllm.cuda_utils', + 'vllm._C', + 'numpy', + 'tqdm', +] + + +class MockedClassDocumenter(autodoc.ClassDocumenter): + """Remove note about base class when a class is derived from object.""" + + def add_line(self, line: str, source: str, *lineno: int) -> None: + if line == ' Bases: :py:class:`object`': + return + super().add_line(line, source, *lineno) + + +autodoc.ClassDocumenter = MockedClassDocumenter + +navigation_with_keys = False diff --git a/docs/en/index.rst b/docs/en/index.rst new file mode 100644 index 00000000..d275dde7 --- /dev/null +++ b/docs/en/index.rst @@ -0,0 +1,26 @@ +.. xtuner documentation master file, created by + sphinx-quickstart on Tue Jan 9 16:33:06 2024. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +Welcome to the MinerU Documentation +============================================== + +.. figure:: ./_static/image/logo.png + :align: center + :alt: mineru + :class: no-scaled-link + +.. raw:: html + +

+ A one-stop, open-source, high-quality data extraction tool + +

+ +

+ + Star + Watch + Fork +

diff --git a/docs/en/make.bat b/docs/en/make.bat new file mode 100644 index 00000000..954237b9 --- /dev/null +++ b/docs/en/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.https://www.sphinx-doc.org/ + exit /b 1 +) + +if "%1" == "" goto help + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/docs/requirements.txt b/docs/requirements.txt new file mode 100644 index 00000000..ec2c6032 --- /dev/null +++ b/docs/requirements.txt @@ -0,0 +1,6 @@ +myst-parser +sphinx +sphinx-argparse +sphinx-book-theme +sphinx-copybutton +sphinx_rtd_theme diff --git a/docs/zh_cn/.readthedocs.yaml b/docs/zh_cn/.readthedocs.yaml new file mode 100644 index 00000000..1f93a4d7 --- /dev/null +++ b/docs/zh_cn/.readthedocs.yaml @@ -0,0 +1,16 @@ +version: 2 + +build: + os: ubuntu-22.04 + tools: + python: "3.10" + +formats: + - epub + +python: + install: + - requirements: docs/requirements.txt + +sphinx: + configuration: docs/zh_cn/conf.py diff --git a/docs/zh_cn/Makefile b/docs/zh_cn/Makefile new file mode 100644 index 00000000..d4bb2cbb --- /dev/null +++ b/docs/zh_cn/Makefile @@ -0,0 +1,20 @@ +# Minimal makefile for Sphinx documentation +# + +# You can set these variables from the command line, and also +# from the environment for the first two. +SPHINXOPTS ?= +SPHINXBUILD ?= sphinx-build +SOURCEDIR = . +BUILDDIR = _build + +# Put it first so that "make" without argument is like "make help". +help: + @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) + +.PHONY: help Makefile + +# Catch-all target: route all unknown targets to Sphinx using the new +# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). +%: Makefile + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) diff --git a/docs/zh_cn/_static/image/logo.png b/docs/zh_cn/_static/image/logo.png new file mode 100644 index 00000000..2e6fdf3a Binary files /dev/null and b/docs/zh_cn/_static/image/logo.png differ diff --git a/docs/zh_cn/conf.py b/docs/zh_cn/conf.py new file mode 100644 index 00000000..4eccd497 --- /dev/null +++ b/docs/zh_cn/conf.py @@ -0,0 +1,122 @@ +# Configuration file for the Sphinx documentation builder. +# +# This file only contains a selection of the most common options. For a full +# list see the documentation: +# https://www.sphinx-doc.org/en/master/usage/configuration.html + +# -- Path setup -------------------------------------------------------------- + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. + +import os +import subprocess +import sys + +from sphinx.ext import autodoc + + +def install(package): + subprocess.check_call([sys.executable, '-m', 'pip', 'install', package]) + + +requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt')) +if os.path.exists(requirements_path): + with open(requirements_path) as f: + packages = f.readlines() + for package in packages: + install(package.strip()) + +sys.path.insert(0, os.path.abspath('../..')) + +# -- Project information ----------------------------------------------------- + +project = 'MinerU' +copyright = '2024, OpenDataLab' +author = 'MinerU Contributors' + +# The full version, including alpha/beta/rc tags +version_file = '../../magic_pdf/libs/version.py' +with open(version_file) as f: + exec(compile(f.read(), version_file, 'exec')) +__version__ = locals()['__version__'] +# The short X.Y version +version = __version__ +# The full version, including alpha/beta/rc tags +release = __version__ + +# -- General configuration --------------------------------------------------- + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'sphinx.ext.napoleon', + 'sphinx.ext.viewcode', + 'sphinx.ext.intersphinx', + 'sphinx_copybutton', + 'sphinx.ext.autodoc', + 'sphinx.ext.autosummary', + 'myst_parser', + 'sphinxarg.ext', +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This pattern also affects html_static_path and html_extra_path. +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# Exclude the prompt "$" when copying code +copybutton_prompt_text = r'\$ ' +copybutton_prompt_is_regexp = True + +language = 'zh_CN' + +# -- Options for HTML output ------------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sphinx_book_theme' +html_logo = '_static/image/logo.png' +html_theme_options = { + 'path_to_docs': 'docs/zh_cn', + 'repository_url': 'https://github.com/opendatalab/MinerU', + 'use_repository_button': True, +} +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +# html_static_path = ['_static'] + +# Mock out external dependencies here. +autodoc_mock_imports = [ + 'cpuinfo', + 'torch', + 'transformers', + 'psutil', + 'prometheus_client', + 'sentencepiece', + 'vllm.cuda_utils', + 'vllm._C', + 'numpy', + 'tqdm', +] + + +class MockedClassDocumenter(autodoc.ClassDocumenter): + """Remove note about base class when a class is derived from object.""" + + def add_line(self, line: str, source: str, *lineno: int) -> None: + if line == ' Bases: :py:class:`object`': + return + super().add_line(line, source, *lineno) + + +autodoc.ClassDocumenter = MockedClassDocumenter + +navigation_with_keys = False diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst new file mode 100644 index 00000000..7b86aeb0 --- /dev/null +++ b/docs/zh_cn/index.rst @@ -0,0 +1,26 @@ +.. xtuner documentation master file, created by + sphinx-quickstart on Tue Jan 9 16:33:06 2024. + You can adapt this file completely to your liking, but it should at least + contain the root `toctree` directive. + +欢迎来到 MinerU 的中文文档 +============================================== + +.. figure:: ./_static/image/logo.png + :align: center + :alt: mineru + :class: no-scaled-link + +.. raw:: html + +

+ 一站式开源高质量数据提取工具 + +

+ +

+ + Star + Watch + Fork +

diff --git a/docs/zh_cn/make.bat b/docs/zh_cn/make.bat new file mode 100644 index 00000000..954237b9 --- /dev/null +++ b/docs/zh_cn/make.bat @@ -0,0 +1,35 @@ +@ECHO OFF + +pushd %~dp0 + +REM Command file for Sphinx documentation + +if "%SPHINXBUILD%" == "" ( + set SPHINXBUILD=sphinx-build +) +set SOURCEDIR=. +set BUILDDIR=_build + +%SPHINXBUILD% >NUL 2>NUL +if errorlevel 9009 ( + echo. + echo.The 'sphinx-build' command was not found. Make sure you have Sphinx + echo.installed, then set the SPHINXBUILD environment variable to point + echo.to the full path of the 'sphinx-build' executable. Alternatively you + echo.may add the Sphinx directory to PATH. + echo. + echo.If you don't have Sphinx installed, grab it from + echo.https://www.sphinx-doc.org/ + exit /b 1 +) + +if "%1" == "" goto help + +%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% +goto end + +:help +%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O% + +:end +popd diff --git a/magic_pdf/dict2md/ocr_mkcontent.py b/magic_pdf/dict2md/ocr_mkcontent.py index 21c9d7a2..f438f6e6 100644 --- a/magic_pdf/dict2md/ocr_mkcontent.py +++ b/magic_pdf/dict2md/ocr_mkcontent.py @@ -8,6 +8,7 @@ from magic_pdf.libs.MakeContentConfig import DropMode, MakeMode from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char from magic_pdf.libs.ocr_content_type import BlockType, ContentType +from magic_pdf.para.para_split_v3 import ListLineTag def __is_hyphen_at_line_end(line): @@ -124,7 +125,7 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, for para_block in paras_of_layout: para_text = '' para_type = para_block['type'] - if para_type == BlockType.Text: + if para_type in [BlockType.Text, BlockType.List, BlockType.Index]: para_text = merge_para_with_text(para_block, parse_type=parse_type, lang=lang) elif para_type == BlockType.Title: para_text = f'# {merge_para_with_text(para_block, parse_type=parse_type, lang=lang)}' @@ -177,22 +178,26 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout, return page_markdown -def merge_para_with_text(para_block, parse_type="auto", lang=None): - - def detect_language(text): - en_pattern = r'[a-zA-Z]+' - en_matches = re.findall(en_pattern, text) - en_length = sum(len(match) for match in en_matches) - if len(text) > 0: - if en_length / len(text) >= 0.5: - return 'en' - else: - return 'unknown' +def detect_language(text): + en_pattern = r'[a-zA-Z]+' + en_matches = re.findall(en_pattern, text) + en_length = sum(len(match) for match in en_matches) + if len(text) > 0: + if en_length / len(text) >= 0.5: + return 'en' else: - return 'empty' + return 'unknown' + else: + return 'empty' + +def merge_para_with_text(para_block, parse_type="auto", lang=None): para_text = '' - for line in para_block['lines']: + for i, line in enumerate(para_block['lines']): + + if i >= 1 and line.get(ListLineTag.IS_LIST_START_LINE, False): + para_text += ' \n' + line_text = '' line_lang = '' for span in line['spans']: diff --git a/magic_pdf/libs/draw_bbox.py b/magic_pdf/libs/draw_bbox.py index 36265fb2..225280f1 100644 --- a/magic_pdf/libs/draw_bbox.py +++ b/magic_pdf/libs/draw_bbox.py @@ -75,6 +75,8 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename): titles_list = [] texts_list = [] interequations_list = [] + lists_list = [] + indexs_list = [] for page in pdf_info: page_dropped_list = [] @@ -83,6 +85,8 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename): titles = [] texts = [] interequations = [] + lists = [] + indexs = [] for dropped_bbox in page['discarded_blocks']: page_dropped_list.append(dropped_bbox['bbox']) @@ -115,6 +119,11 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename): texts.append(bbox) elif block['type'] == BlockType.InterlineEquation: interequations.append(bbox) + elif block['type'] == BlockType.List: + lists.append(bbox) + elif block['type'] == BlockType.Index: + indexs.append(bbox) + tables_list.append(tables) tables_body_list.append(tables_body) tables_caption_list.append(tables_caption) @@ -126,6 +135,8 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename): titles_list.append(titles) texts_list.append(texts) interequations_list.append(interequations) + lists_list.append(lists) + indexs_list.append(indexs) layout_bbox_list = [] @@ -160,6 +171,8 @@ def draw_layout_bbox(pdf_info, pdf_bytes, out_path, filename): draw_bbox_without_number(i, texts_list, page, [153, 0, 76], True) draw_bbox_without_number(i, interequations_list, page, [0, 255, 0], True) + draw_bbox_without_number(i, lists_list, page, [40, 169, 92], True) + draw_bbox_without_number(i, indexs_list, page, [40, 169, 92], True) draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False, draw_bbox=False) @@ -224,6 +237,8 @@ def get_span_info(span): BlockType.Text, BlockType.Title, BlockType.InterlineEquation, + BlockType.List, + BlockType.Index, ]: for line in block['lines']: for span in line['spans']: diff --git a/magic_pdf/libs/ocr_content_type.py b/magic_pdf/libs/ocr_content_type.py index 749c16f9..30d88cfd 100644 --- a/magic_pdf/libs/ocr_content_type.py +++ b/magic_pdf/libs/ocr_content_type.py @@ -20,6 +20,8 @@ class BlockType: InterlineEquation = 'interline_equation' Footnote = 'footnote' Discarded = 'discarded' + List = 'list' + Index = 'index' class CategoryId: diff --git a/magic_pdf/model/doc_analyze_by_custom_model.py b/magic_pdf/model/doc_analyze_by_custom_model.py index dcd18d7c..3fbbea61 100644 --- a/magic_pdf/model/doc_analyze_by_custom_model.py +++ b/magic_pdf/model/doc_analyze_by_custom_model.py @@ -4,6 +4,7 @@ import numpy as np from loguru import logger +from magic_pdf.libs.clean_memory import clean_memory from magic_pdf.libs.config_reader import get_local_models_dir, get_device, get_table_recog_config from magic_pdf.model.model_list import MODEL import magic_pdf.model as model_config @@ -23,7 +24,7 @@ def remove_duplicates_dicts(lst): return unique_dicts -def load_images_from_pdf(pdf_bytes: bytes, dpi=200) -> list: +def load_images_from_pdf(pdf_bytes: bytes, dpi=200, start_page_id=0, end_page_id=None) -> list: try: from PIL import Image except ImportError: @@ -32,18 +33,28 @@ def load_images_from_pdf(pdf_bytes: bytes, dpi=200) -> list: images = [] with fitz.open("pdf", pdf_bytes) as doc: + pdf_page_num = doc.page_count + end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else pdf_page_num - 1 + if end_page_id > pdf_page_num - 1: + logger.warning("end_page_id is out of range, use images length") + end_page_id = pdf_page_num - 1 + for index in range(0, doc.page_count): - page = doc[index] - mat = fitz.Matrix(dpi / 72, dpi / 72) - pm = page.get_pixmap(matrix=mat, alpha=False) + if start_page_id <= index <= end_page_id: + page = doc[index] + mat = fitz.Matrix(dpi / 72, dpi / 72) + pm = page.get_pixmap(matrix=mat, alpha=False) + + # If the width or height exceeds 9000 after scaling, do not scale further. + if pm.width > 9000 or pm.height > 9000: + pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False) - # If the width or height exceeds 9000 after scaling, do not scale further. - if pm.width > 9000 or pm.height > 9000: - pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False) + img = Image.frombytes("RGB", (pm.width, pm.height), pm.samples) + img = np.array(img) + img_dict = {"img": img, "width": pm.width, "height": pm.height} + else: + img_dict = {"img": [], "width": 0, "height": 0} - img = Image.frombytes("RGB", (pm.width, pm.height), pm.samples) - img = np.array(img) - img_dict = {"img": img, "width": pm.width, "height": pm.height} images.append(img_dict) return images @@ -111,14 +122,14 @@ def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False, model_manager = ModelSingleton() custom_model = model_manager.get_model(ocr, show_log, lang) - images = load_images_from_pdf(pdf_bytes) - - # end_page_id = end_page_id if end_page_id else len(images) - 1 - end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(images) - 1 + with fitz.open("pdf", pdf_bytes) as doc: + pdf_page_num = doc.page_count + end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else pdf_page_num - 1 + if end_page_id > pdf_page_num - 1: + logger.warning("end_page_id is out of range, use images length") + end_page_id = pdf_page_num - 1 - if end_page_id > len(images) - 1: - logger.warning("end_page_id is out of range, use images length") - end_page_id = len(images) - 1 + images = load_images_from_pdf(pdf_bytes, start_page_id=start_page_id, end_page_id=end_page_id) model_json = [] doc_analyze_start = time.time() @@ -135,6 +146,11 @@ def doc_analyze(pdf_bytes: bytes, ocr: bool = False, show_log: bool = False, page_dict = {"layout_dets": result, "page_info": page_info} model_json.append(page_dict) + gc_start = time.time() + clean_memory() + gc_time = round(time.time() - gc_start, 2) + logger.info(f"gc time: {gc_time}") + doc_analyze_time = round(time.time() - doc_analyze_start, 2) doc_analyze_speed = round( (end_page_id + 1 - start_page_id) / doc_analyze_time, 2) logger.info(f"doc analyze time: {round(time.time() - doc_analyze_start, 2)}," diff --git a/magic_pdf/model/pdf_extract_kit.py b/magic_pdf/model/pdf_extract_kit.py index 1235a0a8..bca9b987 100644 --- a/magic_pdf/model/pdf_extract_kit.py +++ b/magic_pdf/model/pdf_extract_kit.py @@ -340,7 +340,7 @@ def __call__(self, image): if torch.cuda.is_available(): properties = torch.cuda.get_device_properties(self.device) total_memory = properties.total_memory / (1024 ** 3) # 将字节转换为 GB - if total_memory <= 8: + if total_memory <= 10: gc_start = time.time() clean_memory() gc_time = round(time.time() - gc_start, 2) diff --git a/magic_pdf/para/para_split_v3.py b/magic_pdf/para/para_split_v3.py new file mode 100644 index 00000000..058b0343 --- /dev/null +++ b/magic_pdf/para/para_split_v3.py @@ -0,0 +1,283 @@ +import copy + +from loguru import logger + +from magic_pdf.libs.Constants import LINES_DELETED, CROSS_PAGE +from magic_pdf.libs.ocr_content_type import BlockType, ContentType + +LINE_STOP_FLAG = ('.', '!', '?', '。', '!', '?', ')', ')', '"', '”', ':', ':', ';', ';') +LIST_END_FLAG = ('.', '。', ';', ';') + + +class ListLineTag: + IS_LIST_START_LINE = "is_list_start_line" + IS_LIST_END_LINE = "is_list_end_line" + + +def __process_blocks(blocks): + # 对所有block预处理 + # 1.通过title和interline_equation将block分组 + # 2.bbox边界根据line信息重置 + + result = [] + current_group = [] + + for i in range(len(blocks)): + current_block = blocks[i] + + # 如果当前块是 text 类型 + if current_block['type'] == 'text': + current_block["bbox_fs"] = copy.deepcopy(current_block["bbox"]) + if 'lines' in current_block and len(current_block["lines"]) > 0: + current_block['bbox_fs'] = [min([line['bbox'][0] for line in current_block['lines']]), + min([line['bbox'][1] for line in current_block['lines']]), + max([line['bbox'][2] for line in current_block['lines']]), + max([line['bbox'][3] for line in current_block['lines']])] + current_group.append(current_block) + + # 检查下一个块是否存在 + if i + 1 < len(blocks): + next_block = blocks[i + 1] + # 如果下一个块不是 text 类型且是 title 或 interline_equation 类型 + if next_block['type'] in ['title', 'interline_equation']: + result.append(current_group) + current_group = [] + + # 处理最后一个 group + if current_group: + result.append(current_group) + + return result + + +def __is_list_or_index_block(block): + # 一个block如果是list block 应该同时满足以下特征 + # 1.block内有多个line 2.block 内有多个line左侧顶格写 3.block内有多个line 右侧不顶格(狗牙状) + # 1.block内有多个line 2.block 内有多个line左侧顶格写 3.多个line以endflag结尾 + # 1.block内有多个line 2.block 内有多个line左侧顶格写 3.block内有多个line 左侧不顶格 + + # index block 是一种特殊的list block + # 一个block如果是index block 应该同时满足以下特征 + # 1.block内有多个line 2.block 内有多个line两侧均顶格写 3.line的开头或者结尾均为数字 + if len(block['lines']) >= 3: + first_line = block['lines'][0] + line_height = first_line['bbox'][3] - first_line['bbox'][1] + block_weight = block['bbox_fs'][2] - block['bbox_fs'][0] + + left_close_num = 0 + left_not_close_num = 0 + right_not_close_num = 0 + right_close_num = 0 + lines_text_list = [] + + multiple_para_flag = False + last_line = block['lines'][-1] + # 如果首行左边不顶格而右边顶格,末行左边顶格而右边不顶格 (第一行可能可以右边不顶格) + if (first_line['bbox'][0] - block['bbox_fs'][0] > line_height / 2 and + # block['bbox_fs'][2] - first_line['bbox'][2] < line_height and + abs(last_line['bbox'][0] - block['bbox_fs'][0]) < line_height / 2 and + block['bbox_fs'][2] - last_line['bbox'][2] > line_height + ): + multiple_para_flag = True + + for line in block['lines']: + + line_text = "" + + for span in line['spans']: + span_type = span['type'] + if span_type == ContentType.Text: + line_text += span['content'].strip() + + lines_text_list.append(line_text) + + # 计算line左侧顶格数量是否大于2,是否顶格用abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height/2 来判断 + if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2: + left_close_num += 1 + elif line['bbox'][0] - block['bbox_fs'][0] > line_height: + # logger.info(f"{line_text}, {block['bbox_fs']}, {line['bbox']}") + left_not_close_num += 1 + + # 计算右侧是否顶格 + if abs(block['bbox_fs'][2] - line['bbox'][2]) < line_height: + right_close_num += 1 + else: + # 右侧不顶格情况下是否有一段距离,拍脑袋用0.3block宽度做阈值 + closed_area = 0.3 * block_weight + # closed_area = 5 * line_height + if block['bbox_fs'][2] - line['bbox'][2] > closed_area: + right_not_close_num += 1 + + # 判断lines_text_list中的元素是否有超过80%都以LIST_END_FLAG结尾 + line_end_flag = False + # 判断lines_text_list中的元素是否有超过80%都以数字开头或都以数字结尾 + line_num_flag = False + num_start_count = 0 + num_end_count = 0 + flag_end_count = 0 + if len(lines_text_list) > 0: + for line_text in lines_text_list: + if len(line_text) > 0: + if line_text[-1] in LIST_END_FLAG: + flag_end_count += 1 + if line_text[0].isdigit(): + num_start_count += 1 + if line_text[-1].isdigit(): + num_end_count += 1 + + if flag_end_count / len(lines_text_list) >= 0.8: + line_end_flag = True + + if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8: + line_num_flag = True + + # 有的目录右侧不贴边, 目前认为左边或者右边有一边全贴边,且符合数字规则极为index + if ((left_close_num/len(block['lines']) >= 0.8 or right_close_num/len(block['lines']) >= 0.8) + and line_num_flag + ): + for line in block['lines']: + line[ListLineTag.IS_LIST_START_LINE] = True + return BlockType.Index + + elif left_close_num >= 2 and ( + right_not_close_num >= 2 or line_end_flag or left_not_close_num >= 2) and not multiple_para_flag: + # 处理一种特殊的没有缩进的list,所有行都贴左边,通过右边的空隙判断是否是item尾 + if left_close_num / len(block['lines']) > 0.9: + # 这种是每个item只有一行,且左边都贴边的短item list + if flag_end_count == 0 and right_close_num / len(block['lines']) < 0.5: + for line in block['lines']: + if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2: + line[ListLineTag.IS_LIST_START_LINE] = True + # 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item + elif line_end_flag: + for i, line in enumerate(block['lines']): + if lines_text_list[i][-1] in LIST_END_FLAG: + line[ListLineTag.IS_LIST_END_LINE] = True + if i + 1 < len(block['lines']): + block['lines'][i+1][ListLineTag.IS_LIST_START_LINE] = True + # line item基本没有结束标识符,而且也没有缩进,按右侧空隙判断哪些是item end + else: + line_start_flag = False + for i, line in enumerate(block['lines']): + if line_start_flag: + line[ListLineTag.IS_LIST_START_LINE] = True + line_start_flag = False + elif abs(block['bbox_fs'][2] - line['bbox'][2]) > line_height: + line[ListLineTag.IS_LIST_END_LINE] = True + line_start_flag = True + # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_LINE 结尾且数量和start line 一致 + elif num_start_count >= 2 and num_start_count == flag_end_count: # 简单一点先不考虑左侧不贴边的情况 + for i, line in enumerate(block['lines']): + if lines_text_list[i][0].isdigit(): + line[ListLineTag.IS_LIST_START_LINE] = True + if lines_text_list[i][-1] in LIST_END_FLAG: + line[ListLineTag.IS_LIST_END_LINE] = True + else: + # 正常有缩进的list处理 + for line in block['lines']: + if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2: + line[ListLineTag.IS_LIST_START_LINE] = True + if abs(block['bbox_fs'][2] - line['bbox'][2]) > line_height: + line[ListLineTag.IS_LIST_END_LINE] = True + + return BlockType.List + else: + return BlockType.Text + else: + return BlockType.Text + + +def __merge_2_text_blocks(block1, block2): + if len(block1['lines']) > 0: + first_line = block1['lines'][0] + line_height = first_line['bbox'][3] - first_line['bbox'][1] + block1_weight = block1['bbox'][2] - block1['bbox'][0] + block2_weight = block2['bbox'][2] - block2['bbox'][0] + min_block_weight = min(block1_weight, block2_weight) + if abs(block1['bbox_fs'][0] - first_line['bbox'][0]) < line_height / 2: + last_line = block2['lines'][-1] + if len(last_line['spans']) > 0: + last_span = last_line['spans'][-1] + line_height = last_line['bbox'][3] - last_line['bbox'][1] + if (abs(block2['bbox_fs'][2] - last_line['bbox'][2]) < line_height and + not last_span['content'].endswith(LINE_STOP_FLAG) and + # 两个block宽度差距超过2倍也不合并 + abs(block1_weight - block2_weight) < min_block_weight + ): + if block1['page_num'] != block2['page_num']: + for line in block1['lines']: + for span in line['spans']: + span[CROSS_PAGE] = True + block2['lines'].extend(block1['lines']) + block1['lines'] = [] + block1[LINES_DELETED] = True + + return block1, block2 + + +def __merge_2_list_blocks(block1, block2): + if block1['page_num'] != block2['page_num']: + for line in block1['lines']: + for span in line['spans']: + span[CROSS_PAGE] = True + block2['lines'].extend(block1['lines']) + block1['lines'] = [] + block1[LINES_DELETED] = True + + return block1, block2 + + +def __para_merge_page(blocks): + page_text_blocks_groups = __process_blocks(blocks) + for text_blocks_group in page_text_blocks_groups: + + if len(text_blocks_group) > 0: + # 需要先在合并前对所有block判断是否为list or index block + for block in text_blocks_group: + block_type = __is_list_or_index_block(block) + block['type'] = block_type + # logger.info(f"{block['type']}:{block}") + + if len(text_blocks_group) > 1: + # 倒序遍历 + for i in range(len(text_blocks_group) - 1, -1, -1): + current_block = text_blocks_group[i] + + # 检查是否有前一个块 + if i - 1 >= 0: + prev_block = text_blocks_group[i - 1] + + if current_block['type'] == 'text' and prev_block['type'] == 'text': + __merge_2_text_blocks(current_block, prev_block) + elif ( + (current_block['type'] == BlockType.List and prev_block['type'] == BlockType.List) or + (current_block['type'] == BlockType.Index and prev_block['type'] == BlockType.Index) + ): + __merge_2_list_blocks(current_block, prev_block) + + else: + continue + + +def para_split(pdf_info_dict, debug_mode=False): + all_blocks = [] + for page_num, page in pdf_info_dict.items(): + blocks = copy.deepcopy(page['preproc_blocks']) + for block in blocks: + block['page_num'] = page_num + all_blocks.extend(blocks) + + __para_merge_page(all_blocks) + for page_num, page in pdf_info_dict.items(): + page['para_blocks'] = [] + for block in all_blocks: + if block['page_num'] == page_num: + page['para_blocks'].append(block) + + +if __name__ == '__main__': + input_blocks = [] + # 调用函数 + groups = __process_blocks(input_blocks) + for group_index, group in enumerate(groups): + print(f"Group {group_index}: {group}") diff --git a/magic_pdf/pdf_parse_union_core_v2.py b/magic_pdf/pdf_parse_union_core_v2.py index 1fd7604b..7f01bd50 100644 --- a/magic_pdf/pdf_parse_union_core_v2.py +++ b/magic_pdf/pdf_parse_union_core_v2.py @@ -17,6 +17,7 @@ from magic_pdf.libs.local_math import float_equal from magic_pdf.libs.ocr_content_type import ContentType from magic_pdf.model.magic_model import MagicModel +from magic_pdf.para.para_split_v3 import para_split from magic_pdf.pre_proc.citationmarker_remove import remove_citation_marker from magic_pdf.pre_proc.construct_page_dict import ocr_construct_page_component_v2 from magic_pdf.pre_proc.cut_image import ocr_cut_image_and_table @@ -359,7 +360,7 @@ def parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, need_drop, drop_reason) '''将span填入blocks中''' - block_with_spans, spans = fill_spans_in_blocks(all_bboxes, spans, 0.3) + block_with_spans, spans = fill_spans_in_blocks(all_bboxes, spans, 0.5) '''对block进行fix操作''' fix_blocks = fix_block_spans(block_with_spans, img_blocks, table_blocks) @@ -435,9 +436,7 @@ def pdf_parse_union(pdf_bytes, pdf_info_dict[f"page_{page_id}"] = page_info """分段""" - # para_split(pdf_info_dict, debug_mode=debug_mode) - for page_num, page in pdf_info_dict.items(): - page['para_blocks'] = page['preproc_blocks'] + para_split(pdf_info_dict, debug_mode=debug_mode) """dict转list""" pdf_info_list = dict_to_list(pdf_info_dict) diff --git a/magic_pdf/pre_proc/ocr_detect_all_bboxes.py b/magic_pdf/pre_proc/ocr_detect_all_bboxes.py index 9767030b..8725b884 100644 --- a/magic_pdf/pre_proc/ocr_detect_all_bboxes.py +++ b/magic_pdf/pre_proc/ocr_detect_all_bboxes.py @@ -108,7 +108,7 @@ def ocr_prepare_bboxes_for_layout_split_v2(img_blocks, table_blocks, discarded_b all_bboxes = remove_overlaps_min_blocks(all_bboxes) all_discarded_blocks = remove_overlaps_min_blocks(all_discarded_blocks) '''将剩余的bbox做分离处理,防止后面分layout时出错''' - # all_bboxes, drop_reasons = remove_overlap_between_bbox_for_block(all_bboxes) + all_bboxes, drop_reasons = remove_overlap_between_bbox_for_block(all_bboxes) return all_bboxes, all_discarded_blocks diff --git a/magic_pdf/pre_proc/ocr_dict_merge.py b/magic_pdf/pre_proc/ocr_dict_merge.py index d9a24319..69c4982f 100644 --- a/magic_pdf/pre_proc/ocr_dict_merge.py +++ b/magic_pdf/pre_proc/ocr_dict_merge.py @@ -49,8 +49,7 @@ def merge_spans_to_line(spans): continue # 如果当前的span与当前行的最后一个span在y轴上重叠,则添加到当前行 - if __is_overlaps_y_exceeds_threshold(span['bbox'], - current_line[-1]['bbox']): + if __is_overlaps_y_exceeds_threshold(span['bbox'], current_line[-1]['bbox'], 0.6): current_line.append(span) else: # 否则,开始新行 diff --git a/docs/FAQ_en_us.md b/old_docs/FAQ_en_us.md similarity index 96% rename from docs/FAQ_en_us.md rename to old_docs/FAQ_en_us.md index b177e8fc..21ca861b 100644 --- a/docs/FAQ_en_us.md +++ b/old_docs/FAQ_en_us.md @@ -11,7 +11,7 @@ pip install magic-pdf[full] ### 2. Encountering the error `pickle.UnpicklingError: invalid load key, 'v'.` during use -This might be due to an incomplete download of the model file. You can try re-downloading the model file and then try again. +This might be due to an incomplete download of the model file. You can try re-downloading the model file and then try again. Reference: https://github.com/opendatalab/MinerU/issues/143 ### 3. Where should the model files be downloaded and how should the `/models-dir` configuration be set? @@ -24,7 +24,7 @@ The path for the model files is configured in "magic-pdf.json". just like: } ``` -This path is an absolute path, not a relative path. You can obtain the absolute path in the models directory using the "pwd" command. +This path is an absolute path, not a relative path. You can obtain the absolute path in the models directory using the "pwd" command. Reference: https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874 ### 4. Encountered the error `ImportError: libGL.so.1: cannot open shared object file: No such file or directory` in Ubuntu 22.04 on WSL2 @@ -38,17 +38,22 @@ sudo apt-get install libgl1-mesa-glx Reference: https://github.com/opendatalab/MinerU/issues/388 ### 5. Encountered error `ModuleNotFoundError: No module named 'fairscale'` + You need to uninstall the module and reinstall it: + ```bash pip uninstall fairscale pip install fairscale ``` + Reference: https://github.com/opendatalab/MinerU/issues/411 ### 6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled. The compatibility of cuda11 with new graphics cards is poor, and the CUDA version used by Paddle needs to be upgraded. + ```bash pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` + Reference: https://github.com/opendatalab/MinerU/issues/558 diff --git a/docs/FAQ_zh_cn.md b/old_docs/FAQ_zh_cn.md similarity index 92% rename from docs/FAQ_zh_cn.md rename to old_docs/FAQ_zh_cn.md index 1fe51551..667abfa8 100644 --- a/docs/FAQ_zh_cn.md +++ b/old_docs/FAQ_zh_cn.md @@ -1,9 +1,10 @@ # 常见问题解答 -### 1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full] +### 1.在较新版本的mac上使用命令安装pip install magic-pdf\[full\] zsh: no matches found: magic-pdf\[full\] 在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。 可以通过在命令行禁用globbing特性,再尝试运行安装命令 + ```bash setopt no_nomatch pip install magic-pdf[full] @@ -11,41 +12,50 @@ pip install magic-pdf[full] ### 2.使用过程中遇到_pickle.UnpicklingError: invalid load key, 'v'.错误 -可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试 +可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试 参考:https://github.com/opendatalab/MinerU/issues/143 ### 3.模型文件应该下载到哪里/models-dir的配置应该怎么填 模型文件的路径输入是在"magic-pdf.json"中通过 + ```json { "models-dir": "/tmp/models" } ``` + 进行配置的。 -这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 "pwd" 获取。 +这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 "pwd" 获取。 参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874 ### 4.在WSL2的Ubuntu22.04中遇到报错`ImportError: libGL.so.1: cannot open shared object file: No such file or directory` WSL2的Ubuntu22.04中缺少`libgl`库,可通过以下命令安装`libgl`库解决: + ```bash sudo apt-get install libgl1-mesa-glx ``` + 参考:https://github.com/opendatalab/MinerU/issues/388 ### 5.遇到报错 `ModuleNotFoundError : Nomodulenamed 'fairscale'` + 需要卸载该模块并重新安装 + ```bash pip uninstall fairscale pip install fairscale ``` + 参考:https://github.com/opendatalab/MinerU/issues/411 ### 6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。 cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本 + ```bash pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/ ``` + 参考:https://github.com/opendatalab/MinerU/issues/558 diff --git a/old_docs/README_Ubuntu_CUDA_Acceleration_en_US.md b/old_docs/README_Ubuntu_CUDA_Acceleration_en_US.md new file mode 100644 index 00000000..23adfea4 --- /dev/null +++ b/old_docs/README_Ubuntu_CUDA_Acceleration_en_US.md @@ -0,0 +1,120 @@ +# Ubuntu 22.04 LTS + +### 1. Check if NVIDIA Drivers Are Installed + +```sh +nvidia-smi +``` + +If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2. + +```plaintext ++---------------------------------------------------------------------------------------+ +| NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | +|-----------------------------------------+----------------------+----------------------+ +| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+======================+======================| +| 0 NVIDIA GeForce RTX 3060 Ti WDDM | 00000000:01:00.0 On | N/A | +| 0% 51C P8 12W / 200W | 1489MiB / 8192MiB | 5% Default | +| | | N/A | ++-----------------------------------------+----------------------+----------------------+ +``` + +### 2. Install the Driver + +If no driver is installed, use the following command: + +```sh +sudo apt-get update +sudo apt-get install nvidia-driver-545 +``` + +Install the proprietary driver and restart your computer after installation. + +```sh +reboot +``` + +### 3. Install Anaconda + +If Anaconda is already installed, skip this step. + +```sh +wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh +bash Anaconda3-2024.06-1-Linux-x86_64.sh +``` + +In the final step, enter `yes`, close the terminal, and reopen it. + +### 4. Create an Environment Using Conda + +Specify Python version 3.10. + +```sh +conda create -n MinerU python=3.10 +conda activate MinerU +``` + +### 5. Install Applications + +```sh +pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com +``` + +❗ After installation, make sure to check the version of `magic-pdf` using the following command: + +```sh +magic-pdf --version +``` + +If the version number is less than 0.7.0, please report the issue. + +### 6. Download Models + +Refer to detailed instructions on [how to download model files](how_to_download_models_en.md). + +## 7. Understand the Location of the Configuration File + +After completing the [6. Download Models](#6-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path. +You can find the `magic-pdf.json` file in your user directory. + +> The user directory for Linux is "/home/username". + +### 8. First Run + +Download a sample file from the repository and test it. + +```sh +wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf +magic-pdf -p small_ocr.pdf +``` + +### 9. Test CUDA Acceleration + +If your graphics card has at least **8GB** of VRAM, follow these steps to test CUDA acceleration: + +> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application. + +1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory. + ```json + { + "device-mode": "cuda" + } + ``` +2. Test CUDA acceleration with the following command: + ```sh + magic-pdf -p small_ocr.pdf + ``` + +### 10. Enable CUDA Acceleration for OCR + +1. Download `paddlepaddle-gpu`. Installation will automatically enable OCR acceleration. + ```sh + python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ + ``` +2. Test OCR acceleration with the following command: + ```sh + magic-pdf -p small_ocr.pdf + ``` diff --git a/docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md b/old_docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md similarity index 92% rename from docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md rename to old_docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md index dfca4062..ebef3255 100644 --- a/docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md +++ b/old_docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md @@ -1,10 +1,13 @@ # Ubuntu 22.04 LTS ## 1. 检测是否已安装nvidia驱动 + ```bash -nvidia-smi +nvidia-smi ``` + 如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2 + ``` +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 537.34 Driver Version: 537.34 CUDA Version: 12.2 | @@ -18,78 +21,110 @@ nvidia-smi | | | N/A | +-----------------------------------------+----------------------+----------------------+ ``` + ## 2. 安装驱动 + 如没有驱动,则通过如下命令 + ```bash sudo apt-get update sudo apt-get install nvidia-driver-545 ``` + 安装专有驱动,安装完成后,重启电脑 + ```bash reboot ``` + ## 3. 安装anacoda + 如果已安装conda,可以跳过本步骤 + ```bash wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh bash Anaconda3-2024.06-1-Linux-x86_64.sh ``` + 最后一步输入yes,关闭终端重新打开 + ## 4. 使用conda 创建环境 + 需指定python版本为3.10 + ```bash conda create -n MinerU python=3.10 conda activate MinerU ``` + ## 5. 安装应用 + ```bash pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple ``` + > ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确 -> +> > ```bash > magic-pdf --version ->``` +> ``` +> > 如果版本号小于0.7.0,请到issue中向我们反馈 ## 6. 下载模型 + 详细参考 [如何下载模型文件](how_to_download_models_zh_cn.md) ## 7. 了解配置文件存放的位置 + 完成[6.下载模型](#6-下载模型)步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。 -您可在【用户目录】下找到magic-pdf.json文件。 +您可在【用户目录】下找到magic-pdf.json文件。 + > linux用户目录为 "/home/用户名" ## 8. 第一次运行 + 从仓库中下载样本文件,并测试 + ```bash wget https://gitee.com/myhloli/MinerU/raw/master/demo/small_ocr.pdf magic-pdf -p small_ocr.pdf ``` + ## 9. 测试CUDA加速 + 如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 ->❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。 + +> ❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。 **1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值** + ```json { "device-mode":"cuda" } ``` + **2.运行以下命令测试cuda加速效果** + ```bash magic-pdf -p small_ocr.pdf ``` + > 提示:CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`layout detection cost` 和 `mfr time` 应提速10倍以上。 ## 10. 为ocr开启cuda加速 **1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速** + ```bash python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ ``` + **2.运行以下命令测试ocr加速效果** + ```bash magic-pdf -p small_ocr.pdf ``` + > 提示:CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`ocr cost`应提速10倍以上。 diff --git a/old_docs/README_Windows_CUDA_Acceleration_en_US.md b/old_docs/README_Windows_CUDA_Acceleration_en_US.md new file mode 100644 index 00000000..c170cc08 --- /dev/null +++ b/old_docs/README_Windows_CUDA_Acceleration_en_US.md @@ -0,0 +1,102 @@ +# Windows 10/11 + +### 1. Install CUDA and cuDNN + +Required versions: CUDA 11.8 + cuDNN 8.7.0 + +- CUDA 11.8: https://developer.nvidia.com/cuda-11-8-0-download-archive +- cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x: https://developer.nvidia.com/rdp/cudnn-archive + +### 2. Install Anaconda + +If Anaconda is already installed, you can skip this step. + +Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86_64.exe + +### 3. Create an Environment Using Conda + +Python version must be 3.10. + +``` +conda create -n MinerU python=3.10 +conda activate MinerU +``` + +### 4. Install Applications + +``` +pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com +``` + +> ❗️After installation, verify the version of `magic-pdf`: +> +> ```bash +> magic-pdf --version +> ``` +> +> If the version number is less than 0.7.0, please report it in the issues section. + +### 5. Download Models + +Refer to detailed instructions on [how to download model files](how_to_download_models_en.md). + +### 6. Understand the Location of the Configuration File + +After completing the [5. Download Models](#5-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path. +You can find the `magic-pdf.json` file in your 【user directory】 . + +> The user directory for Windows is "C:/Users/username". + +### 7. First Run + +Download a sample file from the repository and test it. + +```powershell + wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf + magic-pdf -p small_ocr.pdf +``` + +### 8. Test CUDA Acceleration + +If your graphics card has at least 8GB of VRAM, follow these steps to test CUDA-accelerated parsing performance. + +> ❗ Due to the extremely limited nature of 8GB VRAM for running this application, you need to close all other programs using VRAM to ensure that 8GB of VRAM is available when running this application. + +1. **Overwrite the installation of torch and torchvision** supporting CUDA. + + ``` + pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118 + ``` + + > ❗️Ensure the following versions are specified in the command: + > + > ``` + > torch==2.3.1 torchvision==0.18.1 + > ``` + > + > These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail. + +2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory. + + ```json + { + "device-mode": "cuda" + } + ``` + +3. **Run the following command to test CUDA acceleration**: + + ``` + magic-pdf -p small_ocr.pdf + ``` + +### 9. Enable CUDA Acceleration for OCR + +1. **Download paddlepaddle-gpu**, which will automatically enable OCR acceleration upon installation. + ``` + pip install paddlepaddle-gpu==2.6.1 + ``` +2. **Run the following command to test OCR acceleration**: + ``` + magic-pdf -p small_ocr.pdf + ``` diff --git a/docs/README_Windows_CUDA_Acceleration_zh_CN.md b/old_docs/README_Windows_CUDA_Acceleration_zh_CN.md similarity index 92% rename from docs/README_Windows_CUDA_Acceleration_zh_CN.md rename to old_docs/README_Windows_CUDA_Acceleration_zh_CN.md index 53161371..8d2457a6 100644 --- a/docs/README_Windows_CUDA_Acceleration_zh_CN.md +++ b/old_docs/README_Windows_CUDA_Acceleration_zh_CN.md @@ -3,82 +3,108 @@ ## 1. 安装cuda和cuDNN 需要安装的版本 CUDA 11.8 + cuDNN 8.7.0 + - CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive - cuDNN v8.7.0 (November 28th, 2022), for CUDA 11.x https://developer.nvidia.com/rdp/cudnn-archive ## 2. 安装anaconda + 如果已安装conda,可以跳过本步骤 下载链接: https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe ## 3. 使用conda 创建环境 + 需指定python版本为3.10 + ```bash conda create -n MinerU python=3.10 conda activate MinerU ``` + ## 4. 安装应用 + ```bash pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple ``` + > ❗️下载完成后,务必通过以下命令确认magic-pdf的版本是否正确 -> +> > ```bash > magic-pdf --version ->``` +> ``` +> > 如果版本号小于0.7.0,请到issue中向我们反馈 ## 5. 下载模型 + 详细参考 [如何下载模型文件](how_to_download_models_zh_cn.md) ## 6. 了解配置文件存放的位置 + 完成[5.下载模型](#5-下载模型)步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。 您可在【用户目录】下找到magic-pdf.json文件。 + > windows用户目录为 "C:/Users/用户名" ## 7. 第一次运行 + 从仓库中下载样本文件,并测试 + ```powershell wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf magic-pdf -p small_ocr.pdf ``` ## 8. 测试CUDA加速 + 如果您的显卡显存大于等于 **8GB** ,可以进行以下流程,测试CUDA解析加速效果 ->❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。 + +> ❗️因8GB显存运行本应用非常极限,需要关闭所有其他正在使用显存的程序以确保本应用运行时有足额8GB显存可用。 **1.覆盖安装支持cuda的torch和torchvision** + ```bash pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118 ``` + > ❗️务必在命令中指定以下版本 +> > ```bash -> torch==2.3.1 torchvision==0.18.1 +> torch==2.3.1 torchvision==0.18.1 > ``` +> > 这是我们支持的最高版本,如果不指定版本会自动安装更高版本导致程序无法运行 **2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值** + ```json { "device-mode":"cuda" } ``` + **3.运行以下命令测试cuda加速效果** + ```bash magic-pdf -p small_ocr.pdf ``` + > 提示:CUDA加速是否生效可以根据log中输出的各个阶段的耗时来简单判断,通常情况下,`layout detection time` 和 `mfr time` 应提速10倍以上。 ## 9. 为ocr开启cuda加速 **1.下载paddlepaddle-gpu, 安装完成后会自动开启ocr加速** + ```bash pip install paddlepaddle-gpu==2.6.1 ``` + **2.运行以下命令测试ocr加速效果** + ```bash magic-pdf -p small_ocr.pdf ``` -> 提示:CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`ocr time`应提速10倍以上。 +> 提示:CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,`ocr time`应提速10倍以上。 diff --git a/docs/chemical_knowledge_introduction/introduction.pdf b/old_docs/chemical_knowledge_introduction/introduction.pdf similarity index 100% rename from docs/chemical_knowledge_introduction/introduction.pdf rename to old_docs/chemical_knowledge_introduction/introduction.pdf diff --git a/docs/chemical_knowledge_introduction/introduction.xmind b/old_docs/chemical_knowledge_introduction/introduction.xmind similarity index 100% rename from docs/chemical_knowledge_introduction/introduction.xmind rename to old_docs/chemical_knowledge_introduction/introduction.xmind diff --git a/docs/download_models.py b/old_docs/download_models.py similarity index 78% rename from docs/download_models.py rename to old_docs/download_models.py index 7541bdd2..7f116a0c 100644 --- a/docs/download_models.py +++ b/old_docs/download_models.py @@ -1,6 +1,7 @@ +import json import os + import requests -import json from modelscope import snapshot_download @@ -27,13 +28,13 @@ def download_and_modify_json(url, local_filename, modifications): if __name__ == '__main__': model_dir = snapshot_download('opendatalab/PDF-Extract-Kit') layoutreader_model_dir = snapshot_download('ppaanngggg/layoutreader') - model_dir = model_dir + "/models" - print(f"model_dir is: {model_dir}") - print(f"layoutreader_model_dir is: {layoutreader_model_dir}") + model_dir = model_dir + '/models' + print(f'model_dir is: {model_dir}') + print(f'layoutreader_model_dir is: {layoutreader_model_dir}') json_url = 'https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json' - config_file_name = "magic-pdf.json" - home_dir = os.path.expanduser("~") + config_file_name = 'magic-pdf.json' + home_dir = os.path.expanduser('~') config_file = os.path.join(home_dir, config_file_name) json_mods = { @@ -42,5 +43,4 @@ def download_and_modify_json(url, local_filename, modifications): } download_and_modify_json(json_url, config_file, json_mods) - print(f"The configuration file has been configured successfully, the path is: {config_file}") - + print(f'The configuration file has been configured successfully, the path is: {config_file}') diff --git a/docs/download_models_hf.py b/old_docs/download_models_hf.py similarity index 78% rename from docs/download_models_hf.py rename to old_docs/download_models_hf.py index 8bd06901..915f1a24 100644 --- a/docs/download_models_hf.py +++ b/old_docs/download_models_hf.py @@ -1,6 +1,7 @@ +import json import os + import requests -import json from huggingface_hub import snapshot_download @@ -27,13 +28,13 @@ def download_and_modify_json(url, local_filename, modifications): if __name__ == '__main__': model_dir = snapshot_download('opendatalab/PDF-Extract-Kit') layoutreader_model_dir = snapshot_download('hantian/layoutreader') - model_dir = model_dir + "/models" - print(f"model_dir is: {model_dir}") - print(f"layoutreader_model_dir is: {layoutreader_model_dir}") + model_dir = model_dir + '/models' + print(f'model_dir is: {model_dir}') + print(f'layoutreader_model_dir is: {layoutreader_model_dir}') json_url = 'https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json' - config_file_name = "magic-pdf.json" - home_dir = os.path.expanduser("~") + config_file_name = 'magic-pdf.json' + home_dir = os.path.expanduser('~') config_file = os.path.join(home_dir, config_file_name) json_mods = { @@ -42,5 +43,4 @@ def download_and_modify_json(url, local_filename, modifications): } download_and_modify_json(json_url, config_file, json_mods) - print(f"The configuration file has been configured successfully, the path is: {config_file}") - + print(f'The configuration file has been configured successfully, the path is: {config_file}') diff --git a/docs/how_to_download_models_en.md b/old_docs/how_to_download_models_en.md similarity index 87% rename from docs/how_to_download_models_en.md rename to old_docs/how_to_download_models_en.md index 438e4bc8..359afdea 100644 --- a/docs/how_to_download_models_en.md +++ b/old_docs/how_to_download_models_en.md @@ -1,15 +1,17 @@ Model downloads are divided into initial downloads and updates to the model directory. Please refer to the corresponding documentation for instructions on how to proceed. - # Initial download of model files ### 1. Download the Model from Hugging Face + Use a Python Script to Download Model Files from Hugging Face + ```bash pip install huggingface_hub wget https://github.com/opendatalab/MinerU/raw/master/docs/download_models_hf.py -O download_models_hf.py python download_models_hf.py ``` + The Python script will automatically download the model files and configure the model directory in the configuration file. The configuration file can be found in the user directory, with the filename `magic-pdf.json`. @@ -18,7 +20,7 @@ The configuration file can be found in the user directory, with the filename `ma ## 1. Models downloaded via Git LFS ->Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended. +> Due to feedback from some users that downloading model files using git lfs was incomplete or resulted in corrupted model files, this method is no longer recommended. If you previously downloaded model files via git lfs, you can navigate to the previous download directory and use the `git pull` command to update the model. diff --git a/docs/how_to_download_models_zh_cn.md b/old_docs/how_to_download_models_zh_cn.md similarity index 84% rename from docs/how_to_download_models_zh_cn.md rename to old_docs/how_to_download_models_zh_cn.md index dc4be844..6b0db501 100644 --- a/docs/how_to_download_models_zh_cn.md +++ b/old_docs/how_to_download_models_zh_cn.md @@ -10,7 +10,6 @@
pip install huggingface_hub
 wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models_hf.py -O download_models_hf.py
 python download_models_hf.py
-

python脚本执行完毕后,会输出模型下载目录

## 方法二:从 ModelScope 下载模型 @@ -22,26 +21,27 @@ pip install modelscope wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py -O download_models.py python download_models.py ``` + python脚本会自动下载模型文件并配置好配置文件中的模型目录 配置文件可以在用户目录中找到,文件名为`magic-pdf.json` -> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名" +> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名" # 此前下载过模型,如何更新 ## 1. 通过git lfs下载过模型 ->由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。 +> 由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。 如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过`git pull`命令更新模型。 > 0.9.x及以后版本由于新增layout排序模型,且该模型和此前的模型不在同一仓库,不能通过`git pull`命令更新,需要单独下载。 -> ->``` ->from modelscope import snapshot_download ->snapshot_download('ppaanngggg/layoutreader') ->``` +> +> ``` +> from modelscope import snapshot_download +> snapshot_download('ppaanngggg/layoutreader') +> ``` ## 2. 通过 Hugging Face 或 Model Scope 下载过模型 diff --git a/docs/images/MinerU-logo-hq.png b/old_docs/images/MinerU-logo-hq.png similarity index 100% rename from docs/images/MinerU-logo-hq.png rename to old_docs/images/MinerU-logo-hq.png diff --git a/old_docs/images/MinerU-logo.png b/old_docs/images/MinerU-logo.png new file mode 100644 index 00000000..2e6fdf3a Binary files /dev/null and b/old_docs/images/MinerU-logo.png differ diff --git a/docs/images/datalab_logo.png b/old_docs/images/datalab_logo.png similarity index 100% rename from docs/images/datalab_logo.png rename to old_docs/images/datalab_logo.png diff --git a/docs/images/flowchart_en.png b/old_docs/images/flowchart_en.png similarity index 100% rename from docs/images/flowchart_en.png rename to old_docs/images/flowchart_en.png diff --git a/docs/images/flowchart_zh_cn.png b/old_docs/images/flowchart_zh_cn.png similarity index 100% rename from docs/images/flowchart_zh_cn.png rename to old_docs/images/flowchart_zh_cn.png diff --git a/docs/images/layout_example.png b/old_docs/images/layout_example.png similarity index 100% rename from docs/images/layout_example.png rename to old_docs/images/layout_example.png diff --git a/docs/images/poly.png b/old_docs/images/poly.png similarity index 100% rename from docs/images/poly.png rename to old_docs/images/poly.png diff --git a/docs/images/project_panorama_en.png b/old_docs/images/project_panorama_en.png similarity index 100% rename from docs/images/project_panorama_en.png rename to old_docs/images/project_panorama_en.png diff --git a/docs/images/project_panorama_zh_cn.png b/old_docs/images/project_panorama_zh_cn.png similarity index 100% rename from docs/images/project_panorama_zh_cn.png rename to old_docs/images/project_panorama_zh_cn.png diff --git a/docs/images/spans_example.png b/old_docs/images/spans_example.png similarity index 100% rename from docs/images/spans_example.png rename to old_docs/images/spans_example.png diff --git a/docs/images/web_demo_1.png b/old_docs/images/web_demo_1.png similarity index 100% rename from docs/images/web_demo_1.png rename to old_docs/images/web_demo_1.png diff --git a/docs/output_file_en_us.md b/old_docs/output_file_en_us.md similarity index 100% rename from docs/output_file_en_us.md rename to old_docs/output_file_en_us.md diff --git a/docs/output_file_zh_cn.md b/old_docs/output_file_zh_cn.md similarity index 100% rename from docs/output_file_zh_cn.md rename to old_docs/output_file_zh_cn.md