Skip to content

Commit

Permalink
Merge pull request #743 from myhloli/para-split-v3
Browse files Browse the repository at this point in the history
refactor(para_split_v3): merge list and index block detection
  • Loading branch information
myhloli authored Oct 15, 2024
2 parents 702b6ac + 244b868 commit 0d83fb7
Show file tree
Hide file tree
Showing 3 changed files with 111 additions and 82 deletions.
2 changes: 2 additions & 0 deletions magic_pdf/libs/draw_bbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,8 @@ def get_span_info(span):
BlockType.Text,
BlockType.Title,
BlockType.InterlineEquation,
BlockType.List,
BlockType.Index,
]:
for line in block['lines']:
for span in line['spans']:
Expand Down
189 changes: 108 additions & 81 deletions magic_pdf/para/para_split_v3.py

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion magic_pdf/pdf_parse_union_core_v2.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,7 @@ def parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter,
need_drop, drop_reason)

'''将span填入blocks中'''
block_with_spans, spans = fill_spans_in_blocks(all_bboxes, spans, 0.3)
block_with_spans, spans = fill_spans_in_blocks(all_bboxes, spans, 0.5)

'''对block进行fix操作'''
fix_blocks = fix_block_spans(block_with_spans, img_blocks, table_blocks)
Expand Down

0 comments on commit 0d83fb7

Please sign in to comment.