You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When running partition on a two column pdf, text extraction puts characters is the wrong position To Reproduce two_col.pdf
Provide a code snippet that reproduces the issue.
elements = partition("two_col.pdf", strategy="fast")
text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.'
text attribute of elements[3] = 'relationship'
Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'
Expected behavior
Extracted text matches the actual text
Screenshots
Environment Info
Please run python scripts/collect_env.py and paste the output here.
OS version: macOS-14.5-arm64-arm-64bit
Python version: 3.9.6
unstructured version: 0.14.9
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.3.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
LibreOffice version: ==> libreoffice: 24.2.4
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
Looks like this is a problem with the underling use of the pdfminer library. the data returned by the pdfminer.layout.LTTextBoxHorizontal object get_text() method in pdf.py is wrong.
This appears to be related the document being text justified and there being larger spaces between words. The issue appears to be related to the implementation of find_neighbors in the pdfminer layout. To some extent this can be controlled by the LAParams initialized in init_pdfminer. Other libs like PyPDF and (java)PDFBox handle with no issue or special configuration.
Describe the bug
When running partition on a two column pdf, text extraction puts characters is the wrong position
To Reproduce
two_col.pdf
Provide a code snippet that reproduces the issue.
elements = partition("two_col.pdf", strategy="fast")
text attribute of elements[2] = '1. Exchange of Information. The parties agree to exchange Confidential Information for the purpose of (the evaluating a potential business "Purpose") in accordance with this Agreement.'
text attribute of elements[3] = 'relationship'
Actually text from the pdf = '1.Exchange of Information. The parties agree to exchange Confidential Information for the purpose of evaluating a potential business relationship (the "Purpose") in accordance with this Agreement.'
two_col.json
Expected behavior
Extracted text matches the actual text
Screenshots
![image](https://private-user-images.githubusercontent.com/6924335/344335387-ca605e27-1627-4c19-983e-beca874a00a0.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjAxMTgwOTgsIm5iZiI6MTcyMDExNzc5OCwicGF0aCI6Ii82OTI0MzM1LzM0NDMzNTM4Ny1jYTYwNWUyNy0xNjI3LTRjMTktOTgzZS1iZWNhODc0YTAwYTAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI0MDcwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNDA3MDRUMTgyOTU4WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9YTk0OGM3NTk0OTQ0Mzk5YTI0NzVmMzc2NWRkMTNhOTcyYzJlMmIyYzY0ZDVjNDEwODU0MTFkMzRlZjIwOTU2NCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QmYWN0b3JfaWQ9MCZrZXlfaWQ9MCZyZXBvX2lkPTAifQ.n7Z9vL5TJ2GOnYnn20lLpxIZUhK5iKGnAx5RFDbMr_o)
Environment Info
Please run
python scripts/collect_env.py
and paste the output here.OS version: macOS-14.5-arm64-arm-64bit
Python version: 3.9.6
unstructured version: 0.14.9
unstructured-inference version: 0.7.36
pytesseract version: 0.3.10
Torch version: 2.3.1
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.41
magic file from /usr/share/file/magic
LibreOffice version: ==> libreoffice: 24.2.4
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: