Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

When extracting PDF To XML, Last </pages> tag omitted #229

Open
jimitkr opened this issue Jul 14, 2018 · 1 comment
Open

When extracting PDF To XML, Last </pages> tag omitted #229

jimitkr opened this issue Jul 14, 2018 · 1 comment

Comments

@jimitkr
Copy link

jimitkr commented Jul 14, 2018

When converting the attached pdf file to xml using below code, there should be a tag at the end. That tag is omitted.

resource_mgr = PDFResourceManager()
retstr = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = XMLConverter(resource_mgr, retstr, codec=codec, laparams=laparams)
maxpages = 0
caching = True
pagenos=set()
infile_pdf_fp = file(downloaded_pdf_file, 'rb')
interpreter = PDFPageInterpreter(resource_mgr, device)
for page in PDFPage.get_pages(infile_pdf_fp, pagenos, maxpages=maxpages, password='', caching=caching, check_extractable=True):
    interpreter.process_page(page)

data = retstr.getvalue()
device.close()
retstr.close()`

pdf.pdf

Last 5 lines of extracted xml:
</textgroup> </textgroup> </textgroup> </layout> </page>

This is happening with every single PDF. Problem doesn't show up when using pdf2txt.py

@shubhamsaket1993
Copy link

I am facing the same issue with pdfminer. For small size pdfs it works well.
I tried to parse etree it gives the tag missing error.
lxml.etree.XMLSyntaxError: Premature end of data in tag pages line 2, line 1594542, column 1

side2k pushed a commit to side2k/pdfminer that referenced this issue Jul 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants