We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在命令行形式去进行文件的处理出现
import tensorrt_llm failed, if do not use tensorrt, ignore this message import lmdeploy failed, if do not use lmdeploy, ignore this message 2025-02-14 10:00:32.114 | INFO | magic_pdf.data.dataset:__init__:156 - lang: None 2025-02-14 10:00:45.054 | INFO | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 10, cid_chars_radio: 0.0 2025-02-14 10:00:45.055 | WARNING | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: True, by_invalid_chars: True 2025-02-14 10:00:45.056 | INFO | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: True, apply_ocr: True, apply_table: True, table_model: rapid_table, lang: None 2025-02-14 10:00:45.056 | INFO | magic_pdf.model.pdf_extract_kit:__init__:99 - using device: cuda 2025-02-14 10:00:45.056 | INFO | magic_pdf.model.pdf_extract_kit:__init__:103 - using models_dir: /root/.cache/modelscope/hub/OpenDataLab/PDF-Extract-Kit-1___0/models CustomVisionEncoderDecoderModel init VariableUnimerNetModel init VariableUnimerNetPatchEmbeddings init VariableUnimerNetModel init VariableUnimerNetPatchEmbeddings init CustomMBartForCausalLM init CustomMBartDecoder init 2025-02-14 10:00:55,092 - DownloadModel - DEBUG: /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists [2025-02-14 10:00:55,092] [ DEBUG] download_model.py:34 - /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists 2025-02-14 10:00:57.491 | INFO | magic_pdf.model.pdf_extract_kit:__init__:181 - DocAnalysis init done! 2025-02-14 10:00:57.492 | INFO | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:141 - model init cost: 12.437031745910645 2025-02-14 10:00:57.492 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:193 - gpu_memory: 24 GB, batch_ratio: 16 2025-02-14 10:01:01.440 | INFO | magic_pdf.model.batch_analyze:__call__:74 - layout time: 1.99, image num: 22 2025-02-14 10:01:02.956 | INFO | magic_pdf.model.batch_analyze:__call__:85 - mfd time: 1.52, image num: 22 2025-02-14 10:01:10.189 | INFO | magic_pdf.model.batch_analyze:__call__:100 - mfr time: 7.23, image num: 104 2025-02-14 10:01:21.334 | INFO | magic_pdf.model.batch_analyze:__call__:193 - ocr time: 11.01, image num: 340 2025-02-14 10:01:21.335 | INFO | magic_pdf.model.batch_analyze:__call__:197 - table time: 0.0, image num: 0 2025-02-14 10:01:22.604 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:247 - gc time: 0.3 2025-02-14 10:01:22.604 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:251 - doc analyze time: 25.11, speed: 0.88 pages/second 2025-02-14 10:01:22.954 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 0, last_page_cost_time: 0.0 2025-02-14 10:01:23.832 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 1, last_page_cost_time: 0.88 2025-02-14 10:01:23.884 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 2, last_page_cost_time: 0.05 2025-02-14 10:01:23.936 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 3, last_page_cost_time: 0.05 2025-02-14 10:01:23.987 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 4, last_page_cost_time: 0.05 2025-02-14 10:01:24.039 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 5, last_page_cost_time: 0.05 2025-02-14 10:01:24.087 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 6, last_page_cost_time: 0.05 2025-02-14 10:01:24.148 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 7, last_page_cost_time: 0.06 2025-02-14 10:01:24.262 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 8, last_page_cost_time: 0.11 2025-02-14 10:01:24.308 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 9, last_page_cost_time: 0.05 2025-02-14 10:01:24.463 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 10, last_page_cost_time: 0.15 2025-02-14 10:01:24.511 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 11, last_page_cost_time: 0.05 2025-02-14 10:01:24.562 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 12, last_page_cost_time: 0.05 2025-02-14 10:01:24.610 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 13, last_page_cost_time: 0.05 2025-02-14 10:01:24.660 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 14, last_page_cost_time: 0.05 2025-02-14 10:01:24.714 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 15, last_page_cost_time: 0.05 2025-02-14 10:01:24.762 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 16, last_page_cost_time: 0.05 2025-02-14 10:01:24.899 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 17, last_page_cost_time: 0.14 2025-02-14 10:01:24.960 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 18, last_page_cost_time: 0.06 2025-02-14 10:01:25.019 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 19, last_page_cost_time: 0.06 2025-02-14 10:01:25.098 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 20, last_page_cost_time: 0.08 2025-02-14 10:01:25.155 | INFO | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 21, last_page_cost_time: 0.06 2025-02-14 10:01:25.257 | ERROR | magic_pdf.tools.cli:parse_doc:130 - 'content' Traceback (most recent call last): File "/root/data/conda_envs/mineru110/bin/magic-pdf", line 8, in <module> sys.exit(cli()) │ │ └ <Command cli> │ └ <built-in function exit> └ <module 'sys' (built-in)> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) │ │ │ └ {} │ │ └ () │ └ <function BaseCommand.main at 0x7f805d537490> └ <Command cli> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) │ │ └ <click.core.Context object at 0x7f805d903310> │ └ <function Command.invoke at 0x7f805d537f40> └ <Command cli> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) │ │ │ │ │ └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique... │ │ │ │ └ <click.core.Context object at 0x7f805d903310> │ │ │ └ <function cli at 0x7f7de47fef80> │ │ └ <Command cli> │ └ <function Context.invoke at 0x7f805d536cb0> └ <click.core.Context object at 0x7f805d903310> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) │ └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique... └ () File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 137, in cli parse_doc(Path(path)) │ │ └ '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique_ID_81632... │ └ <class 'pathlib.Path'> └ <function cli.<locals>.parse_doc at 0x7f805da03d90> > File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 117, in parse_doc do_parse( └ <function do_parse at 0x7f7de47fe7a0> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 138, in do_parse pipe_result = infer_result.pipe_ocr_mode( │ └ <function InferenceResult.pipe_ocr_mode at 0x7f7de47fe440> └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 144, in pipe_ocr_mode res = self.apply( │ └ <function InferenceResult.apply at 0x7f7de47fe320> └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 70, in apply return proc(copy.deepcopy(self._infer_res), *args, **kwargs) │ │ │ │ │ │ └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None} │ │ │ │ │ └ (<magic_pdf.data.dataset.PymuDocDataset object at 0x7f7de47e3f70>, <magic_pdf.data.data_reader_writer.filebase.FileBasedDataW... │ │ │ │ └ [{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817}, {'category_id': 0, 'p... │ │ │ └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200> │ │ └ <function deepcopy at 0x7f805d2cd6c0> │ └ <module 'copy' from '/root/data/conda_envs/mineru110/lib/python3.10/copy.py'> └ <function InferenceResult.pipe_ocr_mode.<locals>.proc at 0x7f7dd677a560> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 137, in proc res = pdf_parse_union(*args, **kwargs) │ │ └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None} │ └ ([{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817, 'bbox': [63, 102, 274... └ <function pdf_parse_union at 0x7f7de47fe050> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 951, in pdf_parse_union para_split(pdf_info_dict) │ └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'span... └ <function para_split at 0x7f7de4fa53f0> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 378, in para_split __para_merge_page(all_blocks) │ └ [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'spans': [{'bbox': [206, 288, 386, ... └ <function __para_merge_page at 0x7f7de4fa5360> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 355, in __para_merge_page __merge_2_text_blocks(current_block, prev_block) │ │ └ {'type': 'text', 'bbox': [66, 83, 490, 125], 'lines': [{'bbox': [65, 86, 487, 103], 'spans': [{'bbox': [65, 86, 487, 103], 's... │ └ {'type': 'text', 'bbox': [84, 126, 391, 143], 'lines': [{'bbox': [85, 129, 388, 142], 'spans': [{'bbox': [85, 129, 388, 142],... └ <function __merge_2_text_blocks at 0x7f7de4fa51b0> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 288, in __merge_2_text_blocks and not last_span['content'].endswith(LINE_STOP_FLAG) │ └ ('.', '!', '?', '。', '!', '?', ')', ')', '"', '”', ':', ':', ';', ';') └ {'bbox': [343, 104, 488, 131], 'score': 0.103, 'type': 'image', 'image_path': '6d15182824ff8aedbc961f54db71dd13e7aa7364783d1e... KeyError: 'content'
命令形式运行文件 magic-pdf -p "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培 训组题库筛选/田湾2008SRO/unique_ID_81632_unique_第三部分.pdf" -o "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组 题库筛选/田湾2008SRO" -m auto
magic-pdf -p "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培 训组题库筛选/田湾2008SRO/unique_ID_81632_unique_第三部分.pdf" -o "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组 题库筛选/田湾2008SRO" -m auto
在运行之后出现
版本1.1.0
Linux
3.10
1.0.x
cuda
The text was updated successfully, but these errors were encountered:
对于文件也是不能进行上传,文件属于保密形式,但是我看了内容 应该是可以进行文本的提取的
Sorry, something went wrong.
粗看应该是在textblock中存在image span导致的,但是需要pdf文件来debug,您看下能不能私发我一下
抱歉这边的话文件涉及到隐私问题和保密问题,不方便给您,等遇到相同问题的其他不涉及隐私等问题在给您,您看可以嘛?
No branches or pull requests
Description of the bug | 错误描述
在命令行形式去进行文件的处理出现
How to reproduce the bug | 如何复现
命令形式运行文件
magic-pdf -p "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培 训组题库筛选/田湾2008SRO/unique_ID_81632_unique_第三部分.pdf" -o "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组 题库筛选/田湾2008SRO" -m auto
在运行之后出现
版本1.1.0
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
1.0.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: