Trying to Extract Stylized Text from PDF #679
mphox-phoxdev
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
Hi @mphox-phoxdev, looks like this stems from the PDF using different font sizes to represent capital letters, throwing off the alignment of capital and lowercase characters. This tweak, adding a import pdfplumber
pdf = pdfplumber.open("pdfs/help.pdf")
page = pdf.pages[0].dedupe_chars(tolerance=1)
print(page.extract_text(y_tolerance=4)) Result:
You'll notice that the tweak causes problems for the left side of the PDF. If that's a problem, you might want to parse each side separately, using |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I'm trying to extract and parse lines that contain containing "Shattered Sanctuaries", "Radiant Oath", and "Verdant Wheel" from the block of text in Scenario Tags on the right.
When I run an extract text operation, I only see the line with Verdant Wheel, what should I do differently?
returns
Beta Was this translation helpful? Give feedback.
All reactions