pdfplumber failed to open a specific Chinese financial report #537

aaron792 · 2021-11-11T13:46:54Z

aaron792
Nov 11, 2021

I am using pdfplumber to batch process the financial reports of Chinese listed companies. For most financial reports, PDF lumber can process and get better results. However, for a small number of PDF files, pdfplumber cannot open them, and will pop up error prompt as follows:

Traceback (most recent call last):
File "C:\Users\aaron792\PycharmProjects\test\test5.py", line 23, in
doc = PDFDocument(parser)
File "C:\Users\aaron792\PycharmProjects\test\venv\lib\site-packages\pdfminer\pdfdocument.py", line 554, in init
xref.load(parser)
File "C:\Users\aaron792\PycharmProjects\test\venv\lib\site-packages\pdfminer\pdfdocument.py", line 177, in load
(_, obj) = parser.nextobject()
File "C:\Users\aaron792\PycharmProjects\test\venv\lib\site-packages\pdfminer\psparser.py", line 590, in nextobject
raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Type', /'Font', /'Subtype', /'Type0', /'Encoding', /'Identity-H', /'DescendantFonts', [PDFObjRef:190], /'BaseFont', /b'FNTSBS+\xd0\xc2', b'', /'ToUnicode', PDFObjRef:274]

I guess the problem may be in the pdfminer package, because when I use pdfminer to process the same file, the same error will pop up. However when I turn to pdfbox (using java) to handle it, the file was correctly opened and the text in it were finely extracted. But pdfbox can't extract tables, so it still can't fully meet my needs. I put forward this error in the hope that it will be helpful to improve pdfplumber. Overall, pdfplumber is a great tool. Thank you, jsvine!
000691亚太实业2014年年度报告.PDF
000691亚太实业2014年年度报告.PDF

mkl-public · 2021-11-11T19:24:25Z

mkl-public
Nov 11, 2021

It looks like pdfminer has problems parsing PDF name objects containing certain special characters.

The dictionary in question is

<< 
   /Type /Font
   /Subtype /Type0
   /Encoding /Identity-H
   /DescendantFonts [7 0 R ]
   /BaseFont /FNTSBS+ËÎ
   /ToUnicode 305 0 R
>>

but the name /FNTSBS+ËÎ is parsed as two tokens /b'FNTSBS+\xd0\xc2', b''.

("ËÎ" is a an ANSI interpretation of the two bytes 0xCB, 0xCE. That pdfminer makes that \xd0\xc2, seems to indicate that it tries to interpret the name bytes according to some mixed(?) multibyte encoding like UTF-8. Doing so is completely wrong.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdfplumber failed to open a specific Chinese financial report #537

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

pdfplumber failed to open a specific Chinese financial report #537

aaron792 Nov 11, 2021

Replies: 1 comment

mkl-public Nov 11, 2021

aaron792
Nov 11, 2021

mkl-public
Nov 11, 2021