Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:add layoutreader to sort blocks #672

Merged
merged 20 commits into from
Sep 30, 2024

Conversation

myhloli
Copy link
Collaborator

@myhloli myhloli commented Sep 27, 2024

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the
layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function
predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
Implement a new function `draw_layout_sort_bbox` in `draw_bbox.py` to visualize the
layout sorting results using the `LayoutLMv3ForTokenClassification` model. This function
predicts the order of layout elements and draws them in the sorted sequence on the PDF pages.
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.
- Added CUDA cache clearing after layoutreader prediction to free up GPU memory.
- Modified the bbox sorting logic to sort text and title blocks separately.
- Adjusted drawing colors for better distinction in debug visualizations.
…lace the heuristic-based block ordering algorithm with LayoutLMv3 model predictions toimprove the accuracy of block ordering on PDF pages. Additionally, refactor the span

handling during block filling to ensure spans are correctly assigned.

- Introduce LayoutLMv3ForTokenClassification from 'hantian/layoutreader' to predict block
  order.
- Implement span replacement strategy to use pymu spans for non-OCR content.
- Enhance cleanup process to free GPU memory more effectively after model use.
- Adjust block ordering logic to use median line index for text, title, and interline equation blocks.
- Refactor page parsing core logic for better maintainability.

BREAKING CHANGE: The integration of LayoutLMv3 changes the internal block handling and
ordering mechanism, which may affect downstream systems relying on the previous
implementation. Ensure to test thoroughly before deployment.
Add a new function `draw_line_sort_bbox` to visualize the sorting of lines on each page.
This includes indexing lines and handling both text and non-text elements such as tables
and images for better content organization.

Also, comment out GPU-related code for flexibility and remove overlaps in bounding box
detection, which improves the accuracy of layout splitting.
Remove debug code related to layout bbox visualization and adjust drawing functions to
support optional line sorting bboxes. This change includes the removal of `draw_layout_bbox`
function and updates to `draw_bbox_with_number` to support variable line width for bbox drawing.
Introduce an additional argument `draw_bbox` in the `draw_bbox_with_number` function to
enable toggling the drawing of bounding boxes on or off. When set to `False`, no bounding
box will be drawn, allowing for situations where only text
…treader

# Conflicts:
#	magic_pdf/libs/draw_bbox.py
Refactor the draw bbox functions by removing unused imports and simplifying the
code logic for drawing layout and line sorting bounding boxes. Adjust the debug
configuration to enable content list dumping and disable markdown making mode.
…hin classRefactored model initialization to be handled by a singleton class to ensure that model

instances are reused across calls, avoiding redundant initializations. Removed logger
information that was commented out and ensured consistency in logging behavior.
Introduce torch and transformers libraries to support new ML features.Ensure version compatibility by adding torch version within the range 2.2.2 to 2.3.1and include the necessary transformers library.
…awingRemoved legacy commented-out code related to layout_bbox_list from draw_bbox.py, which

was used for diagnostic purposes and was no longer necessary. This change streamlines
the codebase and clarifies the drawing process of bounding boxes on PDF pages. The update
also adjusts the order of operations slightly for improved readability without altering
the functionality.
…xing

Removed redundant sorting of lines by model and optimized calculation of block
indexes by using a single pass through the sorted lines. This change simplifies the
code and potentially improves performance by reducing the number of sortingoperations and unnecessary iterations over blocks without lines.
…ible devices

Blocks without lines are now correctly indexed even when they contain textual content rendered
as images. The sorting logic has been updated to accommodate this scenario. Additionally, the
LayoutLMv3 model initialization has been enhanced to utilize bfloat16 precision on devices that
support it, offering potential performance benefits on supported hardware.
…kage structure

Adapt import statements in `pdf_parse_union_core_v2.py` to reflect the updated packagestructure, changing from the `magic_pdf.v3.helpers` module to the `magic_pdf.model.v3`
module. This ensures compatibility with the revised directory layout.
Update import statements in `pdf_parse_union_core_v2.py` to directly import
`prepare_inputs`, `boxes2inputs`, and `parse_logits` from `magic_pdf.model.v3.helpers`
instead of from `magic_pdf.model.v3`. This change streamlines the imports, making the
code more readable and maintaining a cleaner approach to modular design.
The clean_memory function has been removed from pdf_parse_union_core_v2.py due to it not being used.
This change streamlines the code and prevents potential confusion regarding its purpose.
- Insert lines into blocks based on median line height- Calculate block index using line indices median
- Remove virtual line information for table and image blocks
- Enhance line sorting algorithm for different block types
- Add line height calculation function
@myhloli myhloli merged commit bcbee13 into opendatalab:dev Sep 30, 2024
3 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Sep 30, 2024
@myhloli myhloli deleted the add-layoutreader branch October 9, 2024 03:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant