When extracting PDF To XML, Last </pages> tag omitted #229

jimitkr · 2018-07-14T20:56:01Z

When converting the attached pdf file to xml using below code, there should be a tag at the end. That tag is omitted.

resource_mgr = PDFResourceManager()
retstr = BytesIO()
codec = 'utf-8'
laparams = LAParams()
device = XMLConverter(resource_mgr, retstr, codec=codec, laparams=laparams)
maxpages = 0
caching = True
pagenos=set()
infile_pdf_fp = file(downloaded_pdf_file, 'rb')
interpreter = PDFPageInterpreter(resource_mgr, device)
for page in PDFPage.get_pages(infile_pdf_fp, pagenos, maxpages=maxpages, password='', caching=caching, check_extractable=True):
    interpreter.process_page(page)

data = retstr.getvalue()
device.close()
retstr.close()`

pdf.pdf

Last 5 lines of extracted xml:
</textgroup> </textgroup> </textgroup> </layout> </page>

This is happening with every single PDF. Problem doesn't show up when using pdf2txt.py

The text was updated successfully, but these errors were encountered:

shubhamsaket1993 · 2019-02-14T12:29:02Z

I am facing the same issue with pdfminer. For small size pdfs it works well.
I tried to parse etree it gives the tag missing error.
lxml.etree.XMLSyntaxError: Premature end of data in tag pages line 2, line 1594542, column 1

fixes euske#183, euske#229

side2k pushed a commit to side2k/pdfminer that referenced this issue Jul 14, 2019

name2unicode(): handle hexadecimal constants for unicode glyphs

c4c0a36

fixes euske#183, euske#229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When extracting PDF To XML, Last </pages> tag omitted #229

When extracting PDF To XML, Last </pages> tag omitted #229

jimitkr commented Jul 14, 2018 •

edited

Loading

shubhamsaket1993 commented Feb 14, 2019

When extracting PDF To XML, Last </pages> tag omitted #229

When extracting PDF To XML, Last </pages> tag omitted #229

Comments

jimitkr commented Jul 14, 2018 • edited Loading

shubhamsaket1993 commented Feb 14, 2019

jimitkr commented Jul 14, 2018 •

edited

Loading