pdfplumber failed to open a specific Chinese financial report #537
aaron792
started this conversation in
Ask for help with specific PDFs
Replies: 1 comment
-
It looks like pdfminer has problems parsing PDF name objects containing certain special characters. The dictionary in question is
but the name ("ËÎ" is a an ANSI interpretation of the two bytes 0xCB, 0xCE. That pdfminer makes that \xd0\xc2, seems to indicate that it tries to interpret the name bytes according to some mixed(?) multibyte encoding like UTF-8. Doing so is completely wrong.) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am using pdfplumber to batch process the financial reports of Chinese listed companies. For most financial reports, PDF lumber can process and get better results. However, for a small number of PDF files, pdfplumber cannot open them, and will pop up error prompt as follows:
I guess the problem may be in the pdfminer package, because when I use pdfminer to process the same file, the same error will pop up. However when I turn to pdfbox (using java) to handle it, the file was correctly opened and the text in it were finely extracted. But pdfbox can't extract tables, so it still can't fully meet my needs. I put forward this error in the hope that it will be helpful to improve pdfplumber. Overall, pdfplumber is a great tool. Thank you, jsvine!
000691亚太实业2014年年度报告.PDF
000691亚太实业2014年年度报告.PDF
Beta Was this translation helpful? Give feedback.
All reactions