Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

遇到文件'content'错误 #1686

Open
ledi-1002 opened this issue Feb 14, 2025 · 3 comments
Open

遇到文件'content'错误 #1686

ledi-1002 opened this issue Feb 14, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@ledi-1002
Copy link

Description of the bug | 错误描述

在命令行形式去进行文件的处理出现

import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
2025-02-14 10:00:32.114 | INFO     | magic_pdf.data.dataset:__init__:156 - lang: None
2025-02-14 10:00:45.054 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 10, cid_chars_radio: 0.0
2025-02-14 10:00:45.055 | WARNING  | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: True, by_invalid_chars: True
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: True, apply_ocr: True, apply_table: True, table_model: rapid_table, lang: None
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:99 - using device: cuda
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:103 - using models_dir: /root/.cache/modelscope/hub/OpenDataLab/PDF-Extract-Kit-1___0/models
CustomVisionEncoderDecoderModel init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
CustomMBartForCausalLM init
CustomMBartDecoder init
2025-02-14 10:00:55,092 - DownloadModel - DEBUG: /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
[2025-02-14 10:00:55,092] [   DEBUG] download_model.py:34 - /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
2025-02-14 10:00:57.491 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:181 - DocAnalysis init done!
2025-02-14 10:00:57.492 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:141 - model init cost: 12.437031745910645
2025-02-14 10:00:57.492 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:193 - gpu_memory: 24 GB, batch_ratio: 16
2025-02-14 10:01:01.440 | INFO     | magic_pdf.model.batch_analyze:__call__:74 - layout time: 1.99, image num: 22
2025-02-14 10:01:02.956 | INFO     | magic_pdf.model.batch_analyze:__call__:85 - mfd time: 1.52, image num: 22
2025-02-14 10:01:10.189 | INFO     | magic_pdf.model.batch_analyze:__call__:100 - mfr time: 7.23, image num: 104
2025-02-14 10:01:21.334 | INFO     | magic_pdf.model.batch_analyze:__call__:193 - ocr time: 11.01, image num: 340
2025-02-14 10:01:21.335 | INFO     | magic_pdf.model.batch_analyze:__call__:197 - table time: 0.0, image num: 0
2025-02-14 10:01:22.604 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:247 - gc time: 0.3
2025-02-14 10:01:22.604 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:251 - doc analyze time: 25.11, speed: 0.88 pages/second
2025-02-14 10:01:22.954 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 0, last_page_cost_time: 0.0
2025-02-14 10:01:23.832 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 1, last_page_cost_time: 0.88
2025-02-14 10:01:23.884 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 2, last_page_cost_time: 0.05
2025-02-14 10:01:23.936 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 3, last_page_cost_time: 0.05
2025-02-14 10:01:23.987 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 4, last_page_cost_time: 0.05
2025-02-14 10:01:24.039 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 5, last_page_cost_time: 0.05
2025-02-14 10:01:24.087 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 6, last_page_cost_time: 0.05
2025-02-14 10:01:24.148 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 7, last_page_cost_time: 0.06
2025-02-14 10:01:24.262 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 8, last_page_cost_time: 0.11
2025-02-14 10:01:24.308 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 9, last_page_cost_time: 0.05
2025-02-14 10:01:24.463 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 10, last_page_cost_time: 0.15
2025-02-14 10:01:24.511 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 11, last_page_cost_time: 0.05
2025-02-14 10:01:24.562 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 12, last_page_cost_time: 0.05
2025-02-14 10:01:24.610 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 13, last_page_cost_time: 0.05
2025-02-14 10:01:24.660 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 14, last_page_cost_time: 0.05
2025-02-14 10:01:24.714 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 15, last_page_cost_time: 0.05
2025-02-14 10:01:24.762 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 16, last_page_cost_time: 0.05
2025-02-14 10:01:24.899 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 17, last_page_cost_time: 0.14
2025-02-14 10:01:24.960 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 18, last_page_cost_time: 0.06
2025-02-14 10:01:25.019 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 19, last_page_cost_time: 0.06
2025-02-14 10:01:25.098 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 20, last_page_cost_time: 0.08
2025-02-14 10:01:25.155 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 21, last_page_cost_time: 0.06
2025-02-14 10:01:25.257 | ERROR    | magic_pdf.tools.cli:parse_doc:130 - 'content'
Traceback (most recent call last):

  File "/root/data/conda_envs/mineru110/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7f805d537490>
           └ <Command cli>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7f805d903310>
         │    └ <function Command.invoke at 0x7f805d537f40>
         └ <Command cli>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique...
           │   │      │    │           └ <click.core.Context object at 0x7f805d903310>
           │   │      │    └ <function cli at 0x7f7de47fef80>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x7f805d536cb0>
           └ <click.core.Context object at 0x7f805d903310>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique...
                       └ ()
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 137, in cli
    parse_doc(Path(path))
    │         │    └ '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique_ID_81632...
    │         └ <class 'pathlib.Path'>
    └ <function cli.<locals>.parse_doc at 0x7f805da03d90>
> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 117, in parse_doc
    do_parse(
    └ <function do_parse at 0x7f7de47fe7a0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 138, in do_parse
    pipe_result = infer_result.pipe_ocr_mode(
                  │            └ <function InferenceResult.pipe_ocr_mode at 0x7f7de47fe440>
                  └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 144, in pipe_ocr_mode
    res = self.apply(
          │    └ <function InferenceResult.apply at 0x7f7de47fe320>
          └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 70, in apply
    return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
           │    │    │        │    │             │       └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None}
           │    │    │        │    │             └ (<magic_pdf.data.dataset.PymuDocDataset object at 0x7f7de47e3f70>, <magic_pdf.data.data_reader_writer.filebase.FileBasedDataW...
           │    │    │        │    └ [{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817}, {'category_id': 0, 'p...
           │    │    │        └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
           │    │    └ <function deepcopy at 0x7f805d2cd6c0>
           │    └ <module 'copy' from '/root/data/conda_envs/mineru110/lib/python3.10/copy.py'>
           └ <function InferenceResult.pipe_ocr_mode.<locals>.proc at 0x7f7dd677a560>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 137, in proc
    res = pdf_parse_union(*args, **kwargs)
          │                │       └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None}
          │                └ ([{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817, 'bbox': [63, 102, 274...
          └ <function pdf_parse_union at 0x7f7de47fe050>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 951, in pdf_parse_union
    para_split(pdf_info_dict)
    │          └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'span...
    └ <function para_split at 0x7f7de4fa53f0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 378, in para_split
    __para_merge_page(all_blocks)
    │                 └ [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'spans': [{'bbox': [206, 288, 386, ...
    └ <function __para_merge_page at 0x7f7de4fa5360>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 355, in __para_merge_page
    __merge_2_text_blocks(current_block, prev_block)
    │                     │              └ {'type': 'text', 'bbox': [66, 83, 490, 125], 'lines': [{'bbox': [65, 86, 487, 103], 'spans': [{'bbox': [65, 86, 487, 103], 's...
    │                     └ {'type': 'text', 'bbox': [84, 126, 391, 143], 'lines': [{'bbox': [85, 129, 388, 142], 'spans': [{'bbox': [85, 129, 388, 142],...
    └ <function __merge_2_text_blocks at 0x7f7de4fa51b0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 288, in __merge_2_text_blocks
    and not last_span['content'].endswith(LINE_STOP_FLAG)
            │                             └ ('.', '!', '?', '。', '!', '?', ')', ')', '"', '”', ':', ':', ';', ';')
            └ {'bbox': [343, 104, 488, 131], 'score': 0.103, 'type': 'image', 'image_path': '6d15182824ff8aedbc961f54db71dd13e7aa7364783d1e...

KeyError: 'content'

How to reproduce the bug | 如何复现

命令形式运行文件
magic-pdf -p "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培 训组题库筛选/田湾2008SRO/unique_ID_81632_unique_第三部分.pdf" -o "/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组 题库筛选/田湾2008SRO" -m auto

在运行之后出现

import tensorrt_llm failed, if do not use tensorrt, ignore this message
import lmdeploy failed, if do not use lmdeploy, ignore this message
2025-02-14 10:00:32.114 | INFO     | magic_pdf.data.dataset:__init__:156 - lang: None
2025-02-14 10:00:45.054 | INFO     | magic_pdf.libs.pdf_check:detect_invalid_chars:57 - cid_count: 0, text_len: 10, cid_chars_radio: 0.0
2025-02-14 10:00:45.055 | WARNING  | magic_pdf.filter.pdf_classify_by_type:classify:334 - pdf is not classified by area and text_len, by_image_area: False, by_text: False, by_avg_words: False, by_img_num: True, by_text_layout: False, by_img_narrow_strips: True, by_invalid_chars: True
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:78 - DocAnalysis init, this may take some times, layout_model: doclayout_yolo, apply_formula: True, apply_ocr: True, apply_table: True, table_model: rapid_table, lang: None
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:99 - using device: cuda
2025-02-14 10:00:45.056 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:103 - using models_dir: /root/.cache/modelscope/hub/OpenDataLab/PDF-Extract-Kit-1___0/models
CustomVisionEncoderDecoderModel init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
VariableUnimerNetModel init
VariableUnimerNetPatchEmbeddings init
CustomMBartForCausalLM init
CustomMBartDecoder init
2025-02-14 10:00:55,092 - DownloadModel - DEBUG: /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
[2025-02-14 10:00:55,092] [   DEBUG] download_model.py:34 - /root/data/conda_envs/mineru110/lib/python3.10/site-packages/rapid_table/models/slanet-plus.onnx already exists
2025-02-14 10:00:57.491 | INFO     | magic_pdf.model.pdf_extract_kit:__init__:181 - DocAnalysis init done!
2025-02-14 10:00:57.492 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:custom_model_init:141 - model init cost: 12.437031745910645
2025-02-14 10:00:57.492 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:193 - gpu_memory: 24 GB, batch_ratio: 16
2025-02-14 10:01:01.440 | INFO     | magic_pdf.model.batch_analyze:__call__:74 - layout time: 1.99, image num: 22
2025-02-14 10:01:02.956 | INFO     | magic_pdf.model.batch_analyze:__call__:85 - mfd time: 1.52, image num: 22
2025-02-14 10:01:10.189 | INFO     | magic_pdf.model.batch_analyze:__call__:100 - mfr time: 7.23, image num: 104
2025-02-14 10:01:21.334 | INFO     | magic_pdf.model.batch_analyze:__call__:193 - ocr time: 11.01, image num: 340
2025-02-14 10:01:21.335 | INFO     | magic_pdf.model.batch_analyze:__call__:197 - table time: 0.0, image num: 0
2025-02-14 10:01:22.604 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:247 - gc time: 0.3
2025-02-14 10:01:22.604 | INFO     | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:251 - doc analyze time: 25.11, speed: 0.88 pages/second
2025-02-14 10:01:22.954 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 0, last_page_cost_time: 0.0
2025-02-14 10:01:23.832 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 1, last_page_cost_time: 0.88
2025-02-14 10:01:23.884 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 2, last_page_cost_time: 0.05
2025-02-14 10:01:23.936 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 3, last_page_cost_time: 0.05
2025-02-14 10:01:23.987 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 4, last_page_cost_time: 0.05
2025-02-14 10:01:24.039 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 5, last_page_cost_time: 0.05
2025-02-14 10:01:24.087 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 6, last_page_cost_time: 0.05
2025-02-14 10:01:24.148 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 7, last_page_cost_time: 0.06
2025-02-14 10:01:24.262 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 8, last_page_cost_time: 0.11
2025-02-14 10:01:24.308 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 9, last_page_cost_time: 0.05
2025-02-14 10:01:24.463 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 10, last_page_cost_time: 0.15
2025-02-14 10:01:24.511 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 11, last_page_cost_time: 0.05
2025-02-14 10:01:24.562 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 12, last_page_cost_time: 0.05
2025-02-14 10:01:24.610 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 13, last_page_cost_time: 0.05
2025-02-14 10:01:24.660 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 14, last_page_cost_time: 0.05
2025-02-14 10:01:24.714 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 15, last_page_cost_time: 0.05
2025-02-14 10:01:24.762 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 16, last_page_cost_time: 0.05
2025-02-14 10:01:24.899 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 17, last_page_cost_time: 0.14
2025-02-14 10:01:24.960 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 18, last_page_cost_time: 0.06
2025-02-14 10:01:25.019 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 19, last_page_cost_time: 0.06
2025-02-14 10:01:25.098 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 20, last_page_cost_time: 0.08
2025-02-14 10:01:25.155 | INFO     | magic_pdf.pdf_parse_union_core_v2:pdf_parse_union:931 - page_id: 21, last_page_cost_time: 0.06
2025-02-14 10:01:25.257 | ERROR    | magic_pdf.tools.cli:parse_doc:130 - 'content'
Traceback (most recent call last):

  File "/root/data/conda_envs/mineru110/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7f805d537490>
           └ <Command cli>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7f805d903310>
         │    └ <function Command.invoke at 0x7f805d537f40>
         └ <Command cli>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique...
           │   │      │    │           └ <click.core.Context object at 0x7f805d903310>
           │   │      │    └ <function cli at 0x7f7de47fef80>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x7f805d536cb0>
           └ <click.core.Context object at 0x7f805d903310>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique...
                       └ ()
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 137, in cli
    parse_doc(Path(path))
    │         │    └ '/root/data/jhr/mba_pdf_clean_copy/题库/历年考题/2012SRO复训/考题 在 ndnp-104802.gnpjvc.cgnpc.com.cn 上/培训组题库筛选/田湾2008SRO/unique_ID_81632...
    │         └ <class 'pathlib.Path'>
    └ <function cli.<locals>.parse_doc at 0x7f805da03d90>
> File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 117, in parse_doc
    do_parse(
    └ <function do_parse at 0x7f7de47fe7a0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 138, in do_parse
    pipe_result = infer_result.pipe_ocr_mode(
                  │            └ <function InferenceResult.pipe_ocr_mode at 0x7f7de47fe440>
                  └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 144, in pipe_ocr_mode
    res = self.apply(
          │    └ <function InferenceResult.apply at 0x7f7de47fe320>
          └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 70, in apply
    return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
           │    │    │        │    │             │       └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None}
           │    │    │        │    │             └ (<magic_pdf.data.dataset.PymuDocDataset object at 0x7f7de47e3f70>, <magic_pdf.data.data_reader_writer.filebase.FileBasedDataW...
           │    │    │        │    └ [{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817}, {'category_id': 0, 'p...
           │    │    │        └ <magic_pdf.operators.models.InferenceResult object at 0x7f7dd5f72200>
           │    │    └ <function deepcopy at 0x7f805d2cd6c0>
           │    └ <module 'copy' from '/root/data/conda_envs/mineru110/lib/python3.10/copy.py'>
           └ <function InferenceResult.pipe_ocr_mode.<locals>.proc at 0x7f7dd677a560>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/operators/models.py", line 137, in proc
    res = pdf_parse_union(*args, **kwargs)
          │                │       └ {'start_page_id': 0, 'end_page_id': None, 'debug_mode': True, 'lang': None}
          │                └ ([{'layout_dets': [{'category_id': 2, 'poly': [176, 285, 762, 285, 762, 330, 176, 330], 'score': 0.817, 'bbox': [63, 102, 274...
          └ <function pdf_parse_union at 0x7f7de47fe050>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 951, in pdf_parse_union
    para_split(pdf_info_dict)
    │          └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'span...
    └ <function para_split at 0x7f7de4fa53f0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 378, in para_split
    __para_merge_page(all_blocks)
    │                 └ [{'type': 'title', 'bbox': [131, 283, 463, 352], 'lines': [{'bbox': [206, 288, 386, 311], 'spans': [{'bbox': [206, 288, 386, ...
    └ <function __para_merge_page at 0x7f7de4fa5360>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 355, in __para_merge_page
    __merge_2_text_blocks(current_block, prev_block)
    │                     │              └ {'type': 'text', 'bbox': [66, 83, 490, 125], 'lines': [{'bbox': [65, 86, 487, 103], 'spans': [{'bbox': [65, 86, 487, 103], 's...
    │                     └ {'type': 'text', 'bbox': [84, 126, 391, 143], 'lines': [{'bbox': [85, 129, 388, 142], 'spans': [{'bbox': [85, 129, 388, 142],...
    └ <function __merge_2_text_blocks at 0x7f7de4fa51b0>
  File "/root/data/conda_envs/mineru110/lib/python3.10/site-packages/magic_pdf/post_proc/para_split_v3.py", line 288, in __merge_2_text_blocks
    and not last_span['content'].endswith(LINE_STOP_FLAG)
            │                             └ ('.', '!', '?', '。', '!', '?', ')', ')', '"', '”', ':', ':', ';', ';')
            └ {'bbox': [343, 104, 488, 131], 'score': 0.103, 'type': 'image', 'image_path': '6d15182824ff8aedbc961f54db71dd13e7aa7364783d1e...

KeyError: 'content'


版本1.1.0

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

1.0.x

Device mode | 设备模式

cuda

@ledi-1002 ledi-1002 added the bug Something isn't working label Feb 14, 2025
@ledi-1002
Copy link
Author

对于文件也是不能进行上传,文件属于保密形式,但是我看了内容
应该是可以进行文本的提取的

@myhloli
Copy link
Collaborator

myhloli commented Feb 14, 2025

粗看应该是在textblock中存在image span导致的,但是需要pdf文件来debug,您看下能不能私发我一下

@ledi-1002
Copy link
Author

粗看应该是在textblock中存在image span导致的,但是需要pdf文件来debug,您看下能不能私发我一下

抱歉这边的话文件涉及到隐私问题和保密问题,不方便给您,等遇到相同问题的其他不涉及隐私等问题在给您,您看可以嘛?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants