You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
RuntimeError: [json.exception.parse_error.101] parse error at line 13, column 36: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"/PVOXJK+FlexiFontBZ?<U+001F>'; expected string literal https://arxiv.org/pdf/2410.06488
Bug
Converting a pdf with Chinese characters fails with "syntax error while parsing object key" exception
...
ArxivService.extract_text_from_pdf(self, arxiv_id)
41 source = self._get_pdf_filepath(arxiv_id)
42 converter = DocumentConverter()
---> 43 result = converter.convert(source)
44 pdf_text = result.document.export_to_markdown()
...
miniconda3/envs/bfs/lib/python3.12/site-packages/docling/backend/docling_parse_backend.py#line=24), in DoclingParsePageBackend.init(self, parser, document_hash, page_no, page_obj)
21 def init(
22 self, parser: pdf_parser_v1, document_hash: str, page_no: int, page_obj: PdfPage
23 ):
24 self._ppage = page_obj
---> 25 parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no)
27 self.valid = "pages" in parsed_page
28 if self.valid:
RuntimeError: [json.exception.parse_error.101] parse error at line 13, column 36: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"/PVOXJK+FlexiFontBZ?<U+001F>'; expected string literal
https://arxiv.org/pdf/2410.06488
Steps to reproduce
Parse this pdf:
https://arxiv.org/pdf/2410.06488
Docling version
pip list |grep docling
docling 2.4.2
docling-core 2.3.1
docling-ibm-models 2.0.3
docling-parse 2.0.3
Python version
Python 3.12.4
The text was updated successfully, but these errors were encountered: