Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax error while parsing object key (pdf with Chinese characters) #351

Open
danielkorzekwa opened this issue Nov 15, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@danielkorzekwa
Copy link

Bug

Converting a pdf with Chinese characters fails with "syntax error while parsing object key" exception

...
ArxivService.extract_text_from_pdf(self, arxiv_id)
41 source = self._get_pdf_filepath(arxiv_id)
42 converter = DocumentConverter()
---> 43 result = converter.convert(source)
44 pdf_text = result.document.export_to_markdown()
...
miniconda3/envs/bfs/lib/python3.12/site-packages/docling/backend/docling_parse_backend.py#line=24), in DoclingParsePageBackend.init(self, parser, document_hash, page_no, page_obj)
21 def init(
22 self, parser: pdf_parser_v1, document_hash: str, page_no: int, page_obj: PdfPage
23 ):
24 self._ppage = page_obj
---> 25 parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
27 self.valid = "pages" in parsed_page
28 if self.valid:

RuntimeError: [json.exception.parse_error.101] parse error at line 13, column 36: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"/PVOXJK+FlexiFontBZ?<U+001F>'; expected string literal
https://arxiv.org/pdf/2410.06488

Steps to reproduce

Parse this pdf:
https://arxiv.org/pdf/2410.06488

Docling version

pip list |grep docling
docling 2.4.2
docling-core 2.3.1
docling-ibm-models 2.0.3
docling-parse 2.0.3

Python version

Python 3.12.4

@danielkorzekwa danielkorzekwa added the bug Something isn't working label Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant