Trying to Extract Stylized Text from PDF #679

mphox-phoxdev · 2022-07-03T07:15:42Z

mphox-phoxdev
Jul 3, 2022

Hello, I'm trying to extract and parse lines that contain containing "Shattered Sanctuaries", "Radiant Oath", and "Verdant Wheel" from the block of text in Scenario Tags on the right.

When I run an extract text operation, I only see the line with Verdant Wheel, what should I do differently?

    second_page = plumber_file.pages[1]
    second_text = (second_page.dedupe_chars(tolerance=1).extract_text())
    second_text = second_text.split("\n")
    for line in second_text:
        print(line)

returns

#####removed intentionally#####
LEVELS: 5˜8
PLAY TIME: 4˜5 HOURS
AUTHOR
 Matt Duval
DEVELOPMENT LEAD
 Mike KimmelADDITIONAL DEVELOPMENT
 Linda Zayas-Palmer
EDITING LEAD
 Solomon St. JohnEDITORS
 Simone D. Sallé and Solomon St. John
COVER ARTISTS
 Raphael Madureira and Maurice Risulmi
INTERIOR ARTISTS
 Josef Kuc˜era, Raphael Madureira, and Matias Tapia
CARTOGRAPHER
 Jason EngleART DIRECTION
 Tony Barnett
GRAPHIC DESIGN
 Justin LucasDEVELOPMENT MANAGER
 Linda Zayas-Palmer
ORGANIZED PLAY COORDINATOR
 Alex Speidel
CREATIVE DIRECTOR
 James JacobsDIRECTOR OF COMMUNITY
 Tonya Woldridge
DIRECTOR OF GAME DEVELOPMENT
 Adam Daigle
PUBLISHER Erik MonaAdventure
  
.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 3Appendix 1: Level 5Œ6 Encounters 
.  .  .  .  .  .  .  .  .  .  .  .  .  .  . 20Appendix 2: Level 7Œ8 Encounters 
  .  .  .  .  .  .  .  .  .  .  .  .  .  . 24Appendix 3: Game Aids   .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 29Organized Play
  
.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  . 35Campaign Home Page: path˚ndersociety
.clubBooks: Path˜nder Core Rulebook
, Path˜nder Bestiary, Path˜nder Gamemastery Guide, and 
Lost Omens Grand Bazaar
Maps: Path˜nder Flip-Mat: Bigger Island
Online Resource:
 Path˚nder Reference Document at 
paizo
.com/prd
Scenario tags provide additional information about an adventure™s contents. For more 
information on scenario tags, see the 
Guide to Organized Play: Path˜nder Society
 at 
http://
www.organizedplayfoundation
.org/paizo/guides/
.˜˚˛˝˙ˆˇ˛

˝
˘ˇ˝
˝

The night hag Aslynn has trapped the psychic leader of the Onyx Alliance, Sarnia Blakros, in 
a dream prison while she advances her own schemes. Sarnia disrupts the Path˚nders™ sleep 
in the Grand Lodge with mysterious calls for help in the form of frightening dreams that 
take place on an island in a sea of dust. Unsure of the source of their shared unquiet sleep, 
the Society dispatches agents into the realm from their dreams to ˚nd out what™s happening 
and put a stop to it.paizo.com #36949176, Michael Phox <[email protected]>, Jun 30, 2022paizo.com #36949176, Michael Phox <[email protected]>, Jun 30, 2022215238622152386221523862901095901095901095
901095 paizo.com #36949176, Michael Phox <[email protected]>, Jun 30, 2022 901095
Dreams of a Dustbound Isle
Table of Contents
AUTHOR 
Matt Duval Adventure �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  � 3
DEVELOPMENT LEAD 
Mike Kimmel Appendix 1: Level 5–6 Encounters �  �  �  �  �  �  �  �  �  �  �  �  �  �  � 20
ADDITIONAL DEVELOPMENT 
Linda Zayas-Palmer
Appendix 2: Level 7–8 Encounters   �  �  �  �  �  �  �  �  �  �  �  �  �  � 24
EDITING LEAD 
Solomon St. John
Appendix 3: Game Aids   �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  � 29
EDITORS 
Simone D. Sallé and Solomon St. John
Organized Play  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  � 35
COVER ARTISTS 
Raphael Madureira and Maurice Risulmi
INTERIOR ARTISTS 
Josef Kucě ra, Raphael Madureira, and Matias Tapia GM Resources
CARTOGRAPHER  Campaign Home Page: pathfindersociety�club
Jason Engle
Books: Pathfinder Core Rulebook, Pathfinder Bestiary, Pathfinder Gamemastery Guide, and 
ART DIRECTION  Lost Omens Grand Bazaar
Tony Barnett
Maps: Pathfinder Flip-Mat: Bigger Island
GRAPHIC DESIGN  Online Resource: Pathfinder Reference Document at paizo�com/prd
Justin Lucas
DEVELOPMENT MANAGER 
Linda Zayas-Palmer Scenario Tags
ORGANIZED PLAY COORDINATOR  Scenario tags provide additional information about an adventure’s contents. For more 
Alex Speidel
information on scenario tags, see the Guide to Organized Play: Pathfinder Society at http://
CREATIVE DIRECTOR  www�organizedplayfoundation�org/paizo/guides/.
James Jacobs
M  (S  S )
DIRECTOR OF COMMUNITY  etaplot hattered anctuarieS
Tonya Woldridge
DIRECTOR OF GAME DEVELOPMENT  F  (r  o )
action adiant ath
Adam Daigle
21523862 EPrUikB MLIoSnHaER  Faction (Verdant Wheel) 901095
verdant wheel) 90109 faction not present
Summary
The night hag Aslynn has trapped the psychic leader of the Onyx Alliance, Sarnia Blakros, in 
[help.pdf](https://github.com/jsvine/pdfplumber/files/9034233/help.pdf)

a dream prison while she advances her own schemes. Sarnia disrupts the Pathfinders’ sleep 
in the Grand Lodge with mysterious calls for help in the form of frightening dreams that 
take place on an island in a sea of dust. Unsure of the source of their shared unquiet sleep, 
HOW TO PLAY
the Society dispatches agents into the realm from their dreams to find out what’s happening 
and put a stop to it.
PLAY TIME: 4–5 HOURS
LEVELS: 5–8
PLAYERS: 3–6
Paizo Inc.
7120 185th Ave NE, Ste 120
Redmond, WA 98052-0577
paizo.com
#####removed intentionally#####
``
[help.pdf](https://github.com/jsvine/pdfplumber/files/9034242/help.pdf)
`

jsvine · 2022-07-11T22:26:48Z

jsvine
Jul 11, 2022
Maintainer

Hi @mphox-phoxdev, looks like this stems from the PDF using different font sizes to represent capital letters, throwing off the alignment of capital and lowercase characters. This tweak, adding a y_tolerance to the text extraction, seems to help:

import pdfplumber
pdf = pdfplumber.open("pdfs/help.pdf")
page = pdf.pages[0].dedupe_chars(tolerance=1)
print(page.extract_text(y_tolerance=4))

Result:

[...]
DIRECTOR OF COMMUNITY  MeTaploT (ShaTTeRed SanCTuaRieS)
Tonya Woldridge
DIRECTOR OF GAME DEVELOPMENT  FaCTion (RadianT oaTh)
Adam Daigle
21523862 EPrUikB MLIoSnHaER  FaCTion (VeRdanT Wheel) 901095
[...]

You'll notice that the tweak causes problems for the left side of the PDF. If that's a problem, you might want to parse each side separately, using page.crop(...).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to Extract Stylized Text from PDF #679

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Trying to Extract Stylized Text from PDF #679

mphox-phoxdev Jul 3, 2022

Replies: 1 comment

jsvine Jul 11, 2022 Maintainer

mphox-phoxdev
Jul 3, 2022

jsvine
Jul 11, 2022
Maintainer